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Abstract 

Clustering,  Dimensionality  Reduction,  and  Side  Information 

By 

Hiu  Chung  Law 


Recent  advances  in  sensing  and  storage  technology  have  created  many  high- volume,  high- 
dimensional  data  sets  in  pattern  recognition,  machine  learning,  and  data  mining.  Unsupervised 
learning  can  provide  generic  tools  for  analyzing  and  summarizing  these  data  sets  when  there  is  no 
well-defined  notion  of  classes.  The  purpose  of  this  thesis  is  to  study  some  of  the  open  problems 
in  two  main  areas  of  unsupervised  learning,  namely  clustering  and  (unsupervised)  dimensionality 
reduction.  Instance-level  constraint  on  objects,  an  example  of  side-information,  is  also  considered 
to  improve  the  clustering  results. 

Our  first  contribution  is  a  modification  to  the  isometric  feature  mapping  (ISOMAP)  algorithm 
when  the  input  data,  instead  of  being  all  available  simultaneously,  arrive  sequentially  from  a  data 
stream.  ISOMAP  is  representative  of  a  class  of  nonlinear  dimensionality  reduction  algorithms  that 
are  based  on  the  notion  of  a  manifold.  Both  the  standard  ISOMAP  and  the  landmark  version 
of  ISOMAP  are  considered.  Experimental  results  on  synthetic  data  as  well  as  real  world  images 
demonstrate  that  the  modified  algorithm  can  maintain  an  accurate  low-dimensional  representation 
of  the  data  in  an  efficient  manner. 

We  study  the  problem  of  feature  selection  in  model-based  clustering  when  the  number  of  clusters 
is  unknown.  We  propose  the  concept  of  feature  saliency  and  introduce  an  expectation-maximization 
(EM)  algorithm  for  its  estimation.  By  using  the  minimum  message  length  (MML)  model  selection 
criterion,  the  saliency  of  irrelevant  features  is  driven  towards  zero,  which  corresponds  to  performing 
feature  selection.  The  use  of  MML  can  also  determine  the  number  of  clusters  automatically  by 
pruning  away  the  weak  clusters.  The  proposed  algorithm  is  validated  on  both  synthetic  data  and 
data  sets  from  the  UCI  machine  learning  repository. 

We  have  also  developed  a  new  algorithm  for  incorporating  instance-level  constraints  in  model- 
based  clustering.  Its  main  idea  is  that  we  require  the  cluster  label  of  an  object  to  be  determined 
only  by  its  feature  vector  and  the  cluster  parameters.  In  particular,  the  constraints  should  not 
have  any  direct  influence.  This  consideration  leads  to  a  new  objective  function  that  considers  both 
the  fit  to  the  data  and  the  satisfaction  of  the  constraints  simultaneously.  The  line-search  Newton 
algorithm  is  used  to  find  the  cluster  parameter  vector  that  optimizes  this  objective  function.  This 
approach  is  extended  to  simultaneously  perform  feature  extraction  and  clustering  under  constraints. 
Comparison  of  the  proposed  algorithm  with  competitive  algorithms  over  eighteen  data  sets  from 
different  domains,  including  text  categorization,  low-level  image  segmentation,  appearance-based 
vision,  and  benchmark  data  sets  from  the  UCI  machine  learning  repository,  shows  the  superiority  of 
the  proposed  approach. 
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Chapter  1 


Introduction 


The  most  important  characteristic  of  the  information  age  is  the  abundance  of  data.  Advances  in 
computer  technology,  in  particular  the  Internet,  have  led  to  what  some  people  call  “data  explosion” : 
the  amount  of  data  available  to  any  person  has  increased  so  much  that  it  is  more  than  he  or  she  can 
handle.  According  to  a  recent  study  ^  conducted  at  UC  Berkeley,  the  amount  of  new  data  stored  on 
paper,  film,  magnetic,  and  optical  media  is  estimated  to  have  grown  30%  per  year  between  1999  and 
2002.  In  the  year  2002  alone,  about  5  exabytes  of  new  data  have  been  generated.  (One  exabyte  is 
about  10^  bytes,  or  1000000  terabytes).  Most  of  the  original  data  are  stored  in  electronic  devices 
like  hard  disks  (Table  1.1).  This  increase  in  both  the  volume  and  the  variety  of  data  calls  for  advances 
in  methodology  to  understand,  process,  and  summarize  the  data.  From  a  more  technical  point  of 
view,  understanding  the  structure  of  large  data  sets  arising  from  the  data  explosion  is  of  fundamental 
importance  in  data  mining,  pattern  recognition,  and  machine  learning.  In  this  thesis,  we  focus  on 
two  important  techniques  for  data  analysis  in  pattern  recognition:  dimensionality  reduction  and 
clustering.  We  also  investigate  how  the  addition  of  constraints,  an  example  of  side-information,  can 
assist  in  data  clustering. 


1.1  Data  Analysis 

The  word  “data,”  as  simple  as  it  seems,  is  not  easy  to  define  precisely.  We  shall  adopt  a  pattern 
recognition  perspective  and  regard  data  as  the  description  of  a  set  of  objects  or  patterns  that  can  be 
processed  by  a  computer.  The  objects  are  assumed  to  have  some  commonalities,  so  that  the  same 
systematic  procedure  can  be  applied  to  all  the  objects  to  generate  the  description. 

1.1.1  Types  of  Data 

Data  can  be  classified  into  different  types.  Most  often,  an  object  is  represented  by  the  results 
of  measurement  of  its  various  properties.  A  measurement  result  is  called  “a  feature”  in  pattern 
recognition  or  “a  variable”  in  statistics.  The  concatenation  of  all  the  features  of  a  single  object 
forms  the  feature  vector.  By  arranging  the  feature  vectors  of  different  objects  in  different  rows,  we 
get  a  pattern  matrix  (also  called  “data  matrix” )  of  size  n  by  d ,  where  n  is  the  total  number  of  objects 
and  d  is  the  number  of  features.  This  representation  is  very  popular  because  it  converts  different 
kinds  of  objects  into  a  standard  representation.  If  all  the  features  are  numerical,  an  object  can  be 
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Table  1.1:  Worldwide  production  of  original  data,  if  stored  digitally,  in  terabytes  (TB)  circa  2002. 
Upper  estimates  (denoted  by  “upper” )  assume  the  data  are  digitally  scanned,  while  lower  estimates 
(denoted  by  “lower”)  assume  the  digital  contents  have  been  compressed.  It  is  taken  from  Table  1.2  in 
http : //www . sims .berkeley . edu/research/projects/how-much-inf o- 2003/exec sum. htm.  The 
precise  definitions  of  “paper,”  “film,”  “magnetic,”  and  “optical”  can  be  found  in  the  web  report. 
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3,212,731 

2,132,238 

74.5% 

represented  as  a  point  in  This  enables  a  number  of  mathematical  tools  to  be  used  to  analyze 
the  objects. 

Alternatively,  the  similarity  or  dissimilarity  between  pairs  of  objects  can  be  used  as  the  data 
description.  Specifically,  a  dissimilarity  (similarity)  matrix  of  size  n  by  n  can  be  formed  for  the 
n  objects,  where  the  (i,  j)-th  entry  of  the  matrix  corresponds  to  a  quantitative  assessment  of  how 
dissimilar  (similar)  the  i-th  and  the  j-tli  objects  are.  Dissimilarity  representation  is  useful  in  ap¬ 
plications  where  domain  knowledge  suggests  a  natural  comparison  function,  such  as  the  Hausdorff 
distance  for  geometric  shapes.  Examples  of  using  dissimilarity  for  classification  can  be  seen  in  [132], 
and  more  recently  in  [202].  Pattern  matrix,  on  the  other  hand,  can  be  easier  to  obtain  than  dis¬ 
similarity  matrix.  The  system  designer  can  simply  list  all  the  interesting  attributes  of  the  objects 
to  obtain  the  pattern  matrix,  while  a  good  dissimilarity  measure  with  respect  to  the  task  can  be 
difficult  to  design. 

Similarity/dissimilarity  matrix  can  be  regarded  as  more  generic  than  pattern  matrix,  because 
given  the  feature  vectors  of  a  set  of  objects,  a  dissimilarity  matrix  of  these  objects  can  be  generated  by 
computing  the  distances  among  the  data  points  represented  by  these  feature  vectors.  A  similarity 
matrix  can  be  generated  either  by  subtracting  the  distances  from  a  pre-specified  number,  or  by 
exponentiating  the  negative  of  the  distances.  Pattern  matrix,  on  the  other  hand,  can  be  more  flexible 
because  the  user  can  adjust  the  distance  function  according  to  the  task.  It  is  easier  to  incorporate 
new  information  by  creating  additional  features  than  modifying  the  similarity /dissimiliarity  measure. 
Also,  in  the  common  scenarios  where  there  are  a  large  number  of  patterns  and  a  moderate  number 
of  features,  the  size  of  pattern  matrix,  0(nd ),  is  smaller  than  the  size  of  similarity /dissimilarity 

O 

matrix,  O(n^). 

A  third  possibility  to  represent  an  object  is  by  discrete  structures,  such  as  parse  trees,  ranked 
lists,  or  general  graphs.  Objects  such  as  chemical  structures,  web  pages  with  hyperlinks,  DNA 
sequences,  computer  programs,  or  customer  preference  for  certain  products  have  a  natural  discrete 
structure  representation.  Graph-related  representations  have  also  been  used  in  various  computer 
vision  tasks,  such  as  object  recognition  [145]  and  shape-from-sliading  [217].  Representing  structural 
objects  using  a  vector  of  attributes  can  discard  important  information  on  the  relationship  between 
different  parts  of  the  objects.  On  the  other  hand,  coming  up  with  the  appropriate  dissimilarity 
or  similarity  measure  for  such  objects  is  often  difficult.  New  algorithms  that  can  handle  discrete 
structure  directly  have  been  developed.  An  example  is  seen  in  [154] ,  where  a  kernel  function  (diffusion 
kernel)  is  defined  on  different  vertices  in  a  graph,  leading  to  improved  classification  performance  for 
categorical  data.  Learning  with  structural  data  is  sometimes  called  “learning  with  relational  data,” 
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Figure  1.1:  Comparing  feature  vector,  dissimilarity  matrix  and  a  discrete  structure  on  a  set  of 
artificial  objects.  (Left)  Extracting  different  features  (color,  area,  and  shape  in  this  case)  leads  to  a 
pattern  matrix.  (Center)  A  dissimilarity  measure  on  the  objects  can  be  used  to  compare  different 
pairs  of  objects,  leading  to  a  dissimilarity  matrix.  (Right)  If  the  user  can  provide  relational  properties 
on  the  objects,  a  discrete  structure  like  a  directed  graph  can  be  created. 
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and  several  workshops^  have  been  organized  on  this  theme. 

In  Figure  1.1,  we  provide  a  simple  illustration  contrasting  feature  vector,  dissimilarity  matrix, 
and  discrete  structure  representation  for  a  set  of  artificial  objects.  Each  of  the  representations 
corresponds  to  a  different  view  of  the  objects.  In  practice,  the  system  designer  has  to  choose  the 
representation  that  he  or  she  thinks  is  the  most  relevant  to  the  task. 

In  this  thesis,  we  focus  on  feature  vector  representation,  though  dissimilarity/similarity  informa¬ 
tion  in  the  form  of  instance-level  constraints  is  also  considered. 

1.1.2  Types  of  Features 

Even  within  the  feature  vector  representation,  descriptions  of  an  object  can  be  classified  into  different 
types.  A  feature  is  essentially  a  measurement,  and  the  “scale  of  measurement”  [244]  proposed  by 
Stevens  can  be  used  to  classify  features  into  different  categories.  They  are: 

Nominal:  discrete,  unordered.  Examples:  “apple,”  “orange,”  and  “banana.” 

Ordinal:  discrete,  ordered.  Examples:  “conservative,”  “moderate,”  and  “liberal”. 


2  A  NIPS  workshop  in  2002  (http://mlg.anu.edu.au/unrealdata/)  and  several  ICML  workshops 
(2004:http :  //www .  cs.umd.edu/projects/srl2004/)  (2002:http :  //demo .  cs  .brandeis  .  edu/icml02ws/)  (2000:http: 
//™. inf ormatik.uni-f reiburg. de/~ml/icml2000_workshop .html)  have  been  held  on  how  to  learn  with  structural 
or  relational  data. 
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Interval:  continuous,  no  absolute  zero,  can  be  negative.  Examples:  temperature  in  Fahrenheit. 
Ratio:  continuous,  with  absolute  zero,  positive.  Examples:  length,  weight. 

This  classification  scheme,  however,  is  not  perfect  [256].  One  problem  is  that  a  measurement  may 
not  fit  well  into  any  of  the  categories  listed  in  this  scheme.  An  example  for  this  is  given  in  chapter 
5  in  [191],  which  considers  the  following  types  of  measurements: 

Grades:  ordered  labels  such  as  Freshmen,  Sophomore,  Junior,  Senior. 

Ranks:  starting  from  1,  which  may  be  the  largest  or  the  smallest. 

Counted  fractions:  bounded  by  zero  and  one.  It  includes  percentage,  for  example. 

Counts:  non-negative  integers. 

Amounts:  non-negative  real  numbers. 

Balances:  unbounded,  positive,  or  negative  values. 

Most  people  would  agree  that  these  six  types  of  data  are  different,  yet  all  but  the  third  and  the  last 
would  be  “ordinal”  in  the  scheme  by  Stevens.  “Counted  fractions”  also  do  not  fit  well  into  any  of 
the  category  proposed  by  Stevens. 

Consideration  of  different  types  of  features  can  help  us  to  design  appropriate  algorithms  for 
handling  different  types  of  data  arising  from  different  domains. 

1.1.3  Types  of  Analysis 

The  analysis  to  be  performed  on  the  data  can  also  be  classified  into  different  types.  It  can  be  ex¬ 
ploratory/descriptive,  meaning  that  the  investigator  does  not  have  a  specific  goal  and  only  wants  to 
understand  the  general  characteristics  or  structure  of  the  data.  It  can  be  confirmatory/inferential, 
meaning  that  the  investigator  wants  to  confirm  the  validity  of  a  hypothesis/model  or  a  set  of  as¬ 
sumptions  using  the  available  data.  Many  statistical  techniques  have  been  proposed  to  analyze  the 
data,  such  as  analysis  of  variance  (ANOVA),  linear  regression,  canonical  correlation  analysis  (CCA), 
multidimensional  scaling  (MDS),  factor  analysis  (FA),  or  principal  component  analysis  (PC A),  to 
name  a  few.  A  useful  overview  is  given  in  [245]. 

In  pattern  recognition,  most  of  the  data  analysis  is  concerned  with  predictive  modeling:  given 
some  existing  data  (“training  data”),  we  want  to  predict  the  behavior  of  the  unseen  data  (“testing 
data”).  This  is  often  called  “machine  learning”  or  simply  “learning.”  Depending  on  the  type  of 
feedback  one  can  get  in  the  learning  process,  three  types  of  learning  techniques  have  been  suggested 
[68] .  In  supervised  learning,  labels  on  data  points  are  available  to  indicate  if  the  prediction  is  correct 
or  not.  In  unsupervised  learning,  such  label  information  is  missing.  In  reinforcement  learning,  only 
the  feedback  after  a  sequence  of  actions  that  can  change  the  possibly  unknown  state  of  the  system 
is  given.  In  the  past  few  years,  a  hybrid  learning  scenario  between  supervised  and  unsupervised 
learning,  known  as  semi-supervised  learning,  transductive  learning  [136],  or  learning  with  unlabeled 
data  [195],  has  emerged,  where  only  some  of  the  data  points  have  labels.  This  scenario  happens 
frequently  in  applications,  since  data  collection  and  feature  extraction  can  often  be  automated, 
whereas  the  labeling  of  patterns  or  objects  has  to  be  done  manually  and  this  is  expensive  both 
in  time  and  cost.  In  Chapter  5  we  shall  consider  another  hybrid  scenario  where  instance-level 
constraints,  which  can  be  viewed  as  a  “relaxed”  version  of  labels,  are  available  on  some  of  the  data 
points. 
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Figure  1.2:  An  example  of  dimensionality  reduction.  The  face  images  are  converted  into  a  high 
dimensional  feature  vector  by  concatenating  the  pixels.  Dimensionality  reduction  is  then  used  to 
create  a  set  of  more  manageable  low-dimensional  feature  vectors,  which  can  then  be  used  as  the 
input  to  various  classifiers. 


1.2  Dimensionality  Reduction 

Dimensionality  reduction  deals  with  the  transformation  of  a  high  dimensional  data  set  into  a  low 
dimensional  space,  while  retaining  most  of  the  useful  structure  in  the  original  data.  An  example 
application  of  dimensionality  reduction  with  face  images  can  be  seen  in  Figure  1.2.  Dimensionality 
reduction  has  become  increasingly  important  due  to  the  emergence  of  many  data  sets  with  a  large 
number  of  features.  The  underlying  assumption  for  dimensionality  reduction  is  that  the  data  points 
do  not  lie  randomly  in  the  high-dimensional  space;  rather,  there  is  a  certain  structure  in  the  locations 
of  the  data  points  that  can  be  exploited,  and  the  useful  information  in  high  dimensional  data  can 
be  summarized  by  a  small  number  of  attributes. 


1.2.1  Prevalence  of  High  Dimensional  Data 

High  dimensional  data  have  become  prevalent  in  different  applications  in  pattern  recognition,  ma¬ 
chine  learning,  and  data  mining.  The  definition  of  “high  dimensional”  has  also  changed  from  tens 
of  features  to  hundreds  or  even  tens  of  thousands  of  features  [101]. 

Some  recent  applications  involving  high  dimensional  data  sets  include:  (i)  text  categorization, 
the  representation  of  a  text  document  or  a  web  page  using  the  popular  bag-of-words  model  can 
lead  to  thousands  of  features  [277,  254],  where  each  feature  corresponds  to  the  occurrence  of  a 
keyword  or  a  key-term  in  the  document;  (ii)  appearance-based  computer  vision  approaches  interpret 
each  pixel  as  a  feature  [253,  22].  Images  of  handwritten  digits  can  be  recognized  using  the  pixel 
values  by  neural  networks  [170]  or  support  vector  machines  [255].  Even  for  a  small  image  with 

q 

size  64  by  64,  such  representation  leads  to  more  than  4,000  features;  (iii)  hyperspectral  images0  in 
remote  sensing  lead  to  high  dimensional  data  sets:  each  pixel  can  contain  more  than  200  spectral 


3Information  on  hyperspectral  images  can  be  found  at  http://backserv.gsfc.nasa.gov/nips2003hyperspectral. 
html  and  http://www.eoc.csiro.au/hswww/Overview.htm. 
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measurements  in  different  wavelengths;  (iv)  the  characteristics  of  a  chemical  compound  recorded  by 
a  mass  spectrometer  can  be  represented  by  hundreds  of  features,  where  each  feature  corresponds  to 
the  reading  in  a  particular  range;  (v)  microarray  technology  enables  us  to  measure  the  expression 
levels  of  thousands  of  genes  simultaneously  for  different  subjects  with  different  treatments  [6,  273]. 
Analyzing  microarray  data  is  particularly  challenging,  because  the  number  of  data  points  (subjects 
in  this  case)  is  much  smaller  than  the  number  of  features  (expression  levels  in  this  case). 

High  dimensional  data  can  also  be  derived  in  applications  where  the  initial  number  of  features  is 
moderate.  In  an  image  processing  task,  the  user  can  apply  different  filters  with  different  parameters 
to  extract  a  large  number  of  features  from  a  localized  window  in  the  image.  The  features  are  then 
summarized  by  applying  a  dimensionality  reduction  algorithm  that  matches  the  task  at  hand.  This 
(relatively)  automatic  procedure  contrasts  with  the  traditional  approach,  where  the  user  hand-crafts 
a  small  number  of  salient  features  manually,  often  with  great  effort.  Creating  a  large  feature  set 
and  then  summarizing  the  features  is  advantageous  when  the  domain  is  highly  variable  and  robust 
features  are  hard  to  obtain,  such  as  the  occupant  classification  problem  in  [78]. 

1.2.2  Advantages  of  Dimensionality  Reduction 

Why  should  we  reduce  the  dimensionality  of  a  data  set?  In  principle,  the  more  information  we  have 
about  each  pattern,  the  better  a  learning  algorithm  is  expected  to  perform.  This  seems  to  suggest 
that  we  should  use  as  many  features  as  possible  for  the  task  at  hand.  However,  this  is  not  the  case 
in  practice.  Many  learning  algorithms  perform  poorly  in  a  high  dimensional  space  given  a  small 
number  of  learning  samples.  Often  some  features  in  the  data  set  are  just  “noise”  and  thus  do  not 
contribute  to  (sometimes  even  degrade)  the  learning  process.  This  difficulty  in  analyzing  data  sets 
with  many  features  and  a  small  number  of  samples  is  known  as  the  curse  of  dimensionality  [211]. 

Dimensionality  reduction  can  circumvent  this  problem  by  reducing  the  number  of  features  in  the 
data  set  before  the  training  process.  This  can  also  reduce  the  computation  time,  and  the  resulting 
classifiers  take  less  space  to  store.  Models  with  small  number  of  variables  are  often  easier  for  domain 
experts  to  interpret.  Dimensionality  reduction  is  also  invaluable  as  a  visualization  tool,  where  the 
high  dimensional  data  set  is  transformed  into  two  or  three  dimensions  for  display  purposes.  This 
can  give  the  system  designer  additional  insight  into  the  problem  at  hand. 

The  main  drawback  of  dimensionality  reduction  is  the  possibility  of  information  loss.  When  done 
poorly,  dimensionality  reduction  can  discard  useful  instead  of  irrelevant  information.  No  matter  what 
subsequent  processing  is  to  be  performed,  there  is  no  way  to  recover  this  information  loss. 

1.2.2. 1  Alternatives  to  Dimensionality  Reduction 

In  the  context  of  predictive  modeling,  (explicit)  dimensionality  reduction  is  not  the  only  approach  to 
handle  high  dimensional  data.  The  naive  Bayes  classifier  has  found  empirical  success  in  classifying 
high  dimensional  data  sets  like  webpages  (the  WEB— >KB  project  in  [50]).  Regularized  classifiers 
such  as  support  vector  machines  have  achieved  good  accuracy  for  high  dimensional  data  sets  in  the 
domain  of  text  categorization  [135].  Some  learning  algorithms  have  built-in  feature  selection  abilities 
and  thus  (in  theory)  do  not  require  explicit  dimensionality  reduction.  For  example,  boosting  [90]  can 
use  each  feature  as  a  “weak”  classifier  and  construct  an  overall  classifier  by  selecting  the  appropriate 
features  and  combining  them  [261]. 

Despite  the  apparent  robustness  of  these  learning  algorithms  in  high  dimensional  data  sets,  it 
can  still  be  beneficial  to  reduce  the  dimensionality  first.  Noisy  features  can  degrade  the  performance 
of  support  vector  machines  because  values  of  the  kernel  function  (particular  RBF  kernel  that  de¬ 
pends  on  inter-point  Euclidean  distances)  become  less  reliable  if  many  features  are  irrelevant.  It  is 
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beneficial  to  adjust  the  kernel  to  ignore  those  features  [156],  effectively  performing  dimensionality 
reduction.  Concerns  related  to  efficiency  and  storage  requirement  of  a  classifier  also  suggest  the  use 
of  dimensionality  reduction  as  a  preprocessing  step. 

The  important  lesson  is:  dimensionality  reduction  is  useful  for  most  applications,  yet  the  tol¬ 
erance  for  the  amount  of  information  discarded  should  be  subject  to  the  judgement  of  the  system 
designer.  In  general,  a  more  conservative  dimensionality  reduction  strategy  should  be  employed  if  a 
classifier  that  is  more  robust  to  high  dimensionality  (such  as  support  vector  machines)  is  used.  The 
dimensionality  of  the  data  may  still  be  somewhat  large,  but  at  least  little  useful  information  is  lost. 
On  the  other  hand,  if  a  more  traditional  and  easier-to-understand  classifier  (like  quadratic  discrim¬ 
inant  analysis)  is  to  be  used,  we  should  reduce  the  dimensionality  of  the  data  set  more  aggressively 
to  a  smaller  number,  so  that  the  classifier  can  competently  handle  the  data. 

1.2.3  Techniques  for  Dimensionality  Reduction 

Dimensionality  reduction  techniques  can  be  broadly  divided  into  several  categories:  (i)  feature  se¬ 
lection  and  feature  weighting,  (ii)  feature  extraction,  and  (iii)  feature  grouping. 

1.2. 3.1  Feature  Selection  and  Feature  Weighting 

Feature  selection,  also  known  as  variable  selection  or  subset  selection  in  the  statistics  (particularly 
regression)  literature,  deals  with  the  selection  of  a  subset  of  features  that  is  most  appropriate  for 
the  task  at  hand.  A  feature  is  either  selected  (because  it  is  relevant)  or  discarded  (because  it  is 
irrelevant).  Feature  weighting  [271],  on  the  other  hand,  assigns  weights  (usually  between  zero  and 
one)  to  different  features  to  indicate  the  saliencies  of  the  individual  features.  Most  of  the  literature 
on  feature  selection/ weighting  pertains  to  supervised  learning  (both  classification  [122,  151,  26,  101] 
and  regression  [186]). 


Filters,  Wrappers,  and  Embedded  Algorithms  Feature  selection/ weighting  algorithms  can 
be  broadly  divided  into  three  categories  [26,  151,  101].  The  filter  approaches  evaluate  the  relevance 
of  each  feature  (subset)  using  the  data  set  alone,  regardless  of  the  subsequent  learning  task.  RELIEF 
[147]  and  its  enhancement  [155]  are  representatives  of  this  class,  where  the  basic  idea  is  to  assign 
feature  weights  based  on  the  consistency  of  the  feature  value  in  the  k  nearest  neighbors  of  every 
data  point.  Wrapper  algorithms,  on  the  other  hand,  invoke  the  learning  algorithm  to  evaluate  the 
quality  of  each  feature  (subset).  Specifically,  a  learning  algorithm  ( e.g .,  a  nearest  neighbor  classifier, 
a  decision  tree,  a  naive  Bayes  method)  is  run  using  a  feature  subset  and  the  feature  subset  is  assessed 
by  some  estimate  related  to  the  classification  accuracy.  Often  the  learning  algorithm  is  regarded  as  a 
“black  box”  in  the  sense  that  the  wrapper  algorithm  operates  independent  of  the  internal  mechanism 
of  the  classifier.  An  example  is  [212],  which  used  genetic  search  to  adjust  the  feature  weights  for 
the  best  performance  of  the  k  nearest  neighbor  classifier.  In  the  third  approach  (called  embedded 
in  [101]),  the  learning  algorithm  is  modified  to  have  the  ability  to  perform  feature  selection.  There 
is  no  longer  an  explicit  feature  selection  step;  the  algorithm  automatically  builds  a  classifier  with  a 
small  number  of  features.  LASSO  (least  absolute  shrinkage  and  selection  operator)  [250]  is  a  good 
example  in  this  category.  LASSO  modifies  the  ordinary  least  square  by  including  a  constraint  on 
the  Li  norm  of  the  weight  coefficients.  This  has  the  effect  of  preferring  sparse  regression  coefficients 
(a  formal  statement  for  this  is  proved  in  [65,  64]),  effectively  performing  feature  selection.  Another 
example  is  MARS  (multivariate  adaptive  regression  splines)  [91],  where  choosing  the  variables  used 
in  the  polynomial  splines  effectively  performs  variable  selection.  Automatic  relevance  detection  in 
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neural  networks  [177]  is  another  example,  which  uses  a  Bayesian  approach  to  estimate  the  weights 
in  the  neural  network  as  well  as  the  relevancy  parameters  that  can  be  interpreted  as  feature  weights. 

Filter  approaches  are  generally  faster  because  they  are  classifier-independent  and  only  require 
computation  of  simple  quantities.  They  scale  well  with  the  number  of  features,  and  many  of  them 
can  comfortably  handle  thousands  of  features.  Wrapper  approaches,  on  the  other  hand,  can  be 
superior  in  accuracy  when  compared  with  filters,  which  ignore  the  properties  of  the  learning  task  at 
hand  [151].  They  are,  however,  computationally  more  demanding,  and  do  not  scale  very  well  with 
the  number  of  features.  It  is  because  training  and  evaluating  a  classifier  with  many  features  can 
be  slow,  and  the  performance  of  a  traditional  classifier  with  a  large  number  of  features  may  not  be 
reliable  enough  to  estimate  the  utilities  of  individual  features.  To  get  the  best  results  from  filters 
and  wrappers,  the  user  can  apply  a  filter-type  technique  as  preprocessing  to  cut  down  the  feature 
set  to  a  moderate  size,  and  then  use  a  wrapper  algorithm  to  determine  a  small  yet  discriminative 
feature  subset.  Some  state-of-the-art  feature  selection  algorithms  indeed  adopt  this  approach,  as 
observed  in  [102].  “Embedded”  algorithms  are  highly  specialized  and  it  is  difficult  to  compare  them 
in  general  with  filter  and  wrapper  approaches. 

Quality  of  a  Feature  Subset  Feature  selection/ weighting  algorithms  can  also  be  classified  ac¬ 
cording  to  the  definition  of  “relevance”  or  how  the  quality  of  a  feature  subset  is  assessed.  Five 
definitions  of  relevance  are  given  in  [26].  Information-theoretic  methods  are  often  used  to  evaluate 
features,  because  the  mutual  information  between  a  relevant  feature  and  the  class  labels  should  be 
high  [15].  Non-parametric  methods  can  be  used  to  estimate  the  probability  density  function  of  a 
continuous  feature,  which  in  turn  is  used  to  compute  the  mutual  information  [159,  251].  Correlation 
is  also  used  frequently  to  evaluate  features  [278,  104],  A  feature  can  be  declared  irrelevant  if  it  is 
conditionally  independent  of  the  class  labels  given  other  features.  The  concept  of  Markov  blanket  is 
used  to  formalize  this  notion  of  irrelevancy  in  [153].  RELIEF  [147,  155]  uses  the  consistency  of  the 
feature  value  in  the  k  nearest  neighbors  of  every  data  point  to  quantify  the  usefulness  of  a  feature. 

Optimization  Strategy  Given  a  definition  of  feature  relevancy,  a  feature  selection  algorithm  can 
search  for  the  most  relevant  feature  subset.  Because  of  the  lack  of  monotonicity  (with  respect  to  the 
features)  of  many  feature  relevancy  criteria,  a  combinatorial  search  through  the  space  of  all  possible 
feature  subsets  is  needed.  Usually,  heuristic  (non-exhaustive)  methods  have  to  be  adopted,  because 
the  size  of  this  space  is  exponential  in  the  number  of  features.  In  this  case,  one  generally  loses  any 
guarantee  of  optimality  of  the  selected  feature  subset.  Different  types  of  heuristics,  such  as  sequential 
forward  or  backward  searches,  floating  search,  beam  search,  bi-directional  search,  and  genetic  search 
have  been  suggested  [36,  151,  209,  275].  A  comparison  of  some  of  these  search  heuristics  can  be  found 
in  [211].  In  the  context  of  linear  regression,  sequential  forward  search  is  often  known  as  stepwise 
regression.  Forward  stagewise  regression  is  a  generalization  of  stepwise  regression,  where  a  feature 
is  only  “partially”  selected  by  increasing  the  corresponding  regression  coefficient  by  a  fixed  amount. 
It  is  closely  related  to  LASSO  [250],  and  this  relationship  was  established  via  least  angle  regression 
(LARS),  another  interesting  algorithm  on  its  own,  in  [72]. 

Wrapper  algorithms  generally  include  a  heuristic  search,  as  is  the  case  for  filter  algorithms  with 
feature  quality  criteria  dependent  on  the  features  selected  so  far.  Note  that  feature  weighting 
algorithms  do  not  involve  a  heuristic  search  because  the  weights  for  all  features  are  computed 
simultaneously.  However,  the  computation  of  the  weights  may  be  expensive.  Embedded  approaches 
also  do  not  require  any  heuristic  search.  The  optimal  parameter  is  often  estimated  by  optimizing  a 
certain  objective  function.  Depending  on  the  form  of  the  objective  function,  different  optimization 
strategies  can  be  used.  In  the  case  of  LASSO,  for  example,  a  general  quadratic  programming  solver, 


homotopy  method  [198],  a  modified  version  of  LARS  [72],  or  the  EM  algorithm  [80]  can  be  used  to 
estimate  the  parameters. 

1.2. 3. 2  Feature  Extraction 

In  feature  extraction,  a  small  set  of  new  features  is  constructed  by  a  general  mapping  from  the  high 
dimensional  data.  The  mapping  often  involves  all  the  available  features.  Many  techniques  for  feature 
extraction  have  been  proposed.  In  this  section,  we  describe  some  of  the  linear  feature  extraction 
methods,  i.e. ,  the  extracted  features  can  be  written  as  linear  combinations  of  the  original  features. 
Nonlinear  feature  extraction  techniques  are  more  sophisticated.  In  Chapter  2  we  shall  examine  some 
of  the  recent  nonlinear  feature  extraction  algorithms  in  more  detail.  The  readers  may  also  find  two 
recent  surveys  [284,  34]  useful  in  this  regard. 

Unsupervised  Techniques  “Unsupervised”  here  refers  to  the  fact  that  these  feature  extraction 
techniques  are  based  only  on  the  data  (pattern  matrix),  without  pattern  label  information.  Principal 
component  analysis  (PC A),  also  known  as  Karhunen-Loeve  Transform  or  simply  KL  transform,  is 
arguably  the  most  popular  feature  extraction  method.  PCA  finds  a  hyperplane  such  that,  upon  pro¬ 
jection  to  the  hyperplane,  the  data  variance  is  best  preserved.  The  optimal  hyperplane  is  spanned  by 
the  principal  components,  which  are  the  leading  eigenvectors  of  the  sample  covariance  matrix.  Fea¬ 
tures  extracted  by  PCA  consist  of  the  projection  of  the  data  points  to  different  principal  components. 
When  the  features  extracted  by  PCA  are  used  for  linear  regression,  it  is  sometimes  called  “principal 
component  regression”.  Recently,  sparse  variants  of  PCA  have  also  been  proposed  [137,  291,  52], 
where  each  principal  component  only  has  a  small  number  of  non-zero  coefficients. 

Factor  analysis  (FA)  can  also  be  used  for  feature  extraction.  FA  assumes  that  the  observed  high 
dimensional  data  points  are  the  results  of  a  linear  function  (expressed  by  the  factor  loading  matrix) 
on  a  few  unobserved  random  variables,  together  with  uncorrelated  zero-mean  noise.  After  estimating 
the  factor  loading  matrix  and  the  variance  of  the  noise,  the  factor  scores  for  different  patterns  can 
be  estimated  and  serve  as  a  low-dinrensional  representation  of  the  data. 

Supervised  Techniques  Labels  in  classification  and  response  variables  in  regression  can  be  used 
together  with  the  data  to  extract  more  relevant  features.  Linear  discriminant  analysis  (LDA)  finds 
the  projection  direction  such  that  the  ratio  of  between-class  variance  to  within-class  variance  is 
the  largest.  When  there  are  more  than  two  classes,  multiple  discriminant  analysis  (MDA)  finds 
a  sequence  of  projection  directions  that  maximizes  a  similar  criterion.  Features  are  extracted  by 
projecting  the  data  points  to  these  directions. 

Partial  least  squares  (PLS)  can  be  viewed  as  the  regression  counterpart  of  LDA.  Instead  of 
extracting  features  by  retaining  maximum  data  variance  as  in  principal  component  regression,  PLS 
finds  projection  directions  that  can  best  explain  the  response  variable.  Canonical  correlation  analysis 
(CCA)  is  a  closely  related  technique  that  finds  projection  directions  that  maximize  the  correlation 
between  the  response  variables  and  the  features  extracted  by  projection. 

1.2. 3. 3  Feature  Grouping 

In  feature  grouping,  new  features  are  constructed  by  combining  several  existing  features.  Feature 
grouping  can  be  useful  in  scenarios  where  it  can  be  more  meaningful  to  combine  features  due  to  the 
characteristics  of  the  domain.  For  example,  in  a  text  categorization  task  different  words  can  have 
similar  meanings  and  combining  them  into  a  single  word  class  is  more  appropriate.  Another  example 
is  the  use  of  power  spectrum  for  classification,  where  each  feature  corresponds  to  the  energy  in  a 
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certain  frequency  range.  The  preset  boundaries  of  the  frequency  ranges  can  be  sub-optimal,  and  the 
sum  of  features  from  adjacent  frequency  ranges  can  lead  to  a  more  meaningful  feature  by  capturing 
the  energy  in  a  wider  frequency  range.  For  gene  expression  data,  genes  that  are  similar  may  share 
a  common  biological  pathway  and  the  grouping  of  predictive  genes  can  be  of  interest  to  biologists 
[108,  230,  59], 

The  most  direct  way  to  perform  feature  grouping  is  to  cluster  the  features  (instead  of  the  objects) 
of  a  data  set.  Feature  clustering  is  not  new;  the  SAS/STAT  procedure  “varclus”  for  variable  cluster¬ 
ing  was  written  before  1990  [225].  It  is  performed  by  applying  the  hierarchical  clustering  method  on 
a  similarity  matrix  of  different  features,  which  is  derived  by,  say,  the  Pearson’s  correlation  coefficient. 
This  scheme  was  probably  first  proposed  in  [124],  which  also  suggested  summarizing  one  group  of 
features  by  a  single  feature  in  order  to  achieve  dimensionality  reduction.  Recently,  feature  clustering 
has  been  applied  to  boost  the  performance  in  text  categorization.  Techniques  based  on  distribution 
clustering  [4],  mutual  information  [62],  and  information  bottleneck  [238]  have  also  been  proposed. 

Features  can  also  be  clustered  together  with  the  objects.  As  mentioned  in  [201],  this  idea  has  been 
known  under  different  names  in  the  literature,  including  “bi-clustering”  [41,  150],  “co-clustering” 
[63,  61],  “double-clustering”  [73],  “coupled  clustering”  [95],  and  “simultaneous  clustering”  [208]. 
A  bipartite  graph  can  be  used  to  represent  the  relationship  between  objects  and  features,  and  the 
partitioning  of  the  graph  can  be  used  to  cluster  the  objects  and  the  features  simultaneously  [281,  61]. 
Information  bottleneck  can  also  be  used  for  this  task  [237]. 

In  the  context  of  regression,  feature  grouping  can  be  achieved  indirectly  by  favoring  similar 
features  to  have  similar  coefficients.  This  can  be  done  by  combining  ridge  regression  with  LASSO, 
leading  to  the  elastic  net  regression  algorithm  [290]. 

1.3  Data  Clustering 

The  goal  of  (data)  clustering,  also  known  as  cluster  analysis,  is  to  discover  the  “natural”  grouping(s) 
of  a  set  of  patterns,  points,  or  objects.  Webster  ^  defines  cluster  analysis  as  “a  statistical  classification 
technique  for  discovering  whether  the  individuals  of  a  population  fall  into  different  groups  by  making 
quantitative  comparisons  of  multiple  characteristics.”  An  example  of  clustering  can  be  seen  in 
Figure  1.3.  The  unlabeled  data  set  in  Figure  1.3(a)  is  assigned  labels  by  a  clustering  procedure  in 
order  to  discover  the  natural  grouping  of  the  three  groups  as  shown  in  Figure  1.3(b). 

Cluster  analysis  is  prevalent  in  any  discipline  that  involves  analysis  of  multivariate  data.  It  is 
difficult  to  exhaustively  list  the  numerous  uses  of  clustering  techniques.  Image  segmentation,  an 
important  problem  in  computer  vision,  can  be  formulated  as  a  clustering  problem  [94,  128,  234]. 
Documents  can  be  clustered  [120]  to  generate  topical  hierarchies  for  information  access  [221]  or 
retrieval  [20].  Clustering  is  also  used  to  perform  market  segmentation  [3,  39]  as  well  as  to  study 
genome  data  [6]  in  biology. 

Clustering,  unfortunately,  is  difficult  for  most  data  sets.  A  non-trivial  example  of  clustering 
is  shown  in  Figure  1.4.  Unlike  the  three  well-separated,  spherical  clusters  in  Figure  1.3,  the  seven 
clusters  in  Figure  1.4  have  diverse  shapes:  globular,  circular,  and  spiral  in  this  case.  The  densities  and 
the  sizes  of  the  clusters  are  also  different.  The  presence  of  background  noise  makes  the  detection  of 
the  clusters  even  more  difficult.  This  example  also  illustrates  the  fundamental  difficulty  of  clustering. 
The  diversity  of  “good”  clusters  in  different  scenarios  make  it  virtually  impossible  for  one  to  provide 
a  universal  definition  of  “good”  clusters.  In  fact,  it  has  been  proved  in  [149]  that  it  is  impossible 
for  any  clustering  algorithm  to  achieve  some  fairly  basic  goals  simultaneously.  Therefore,  it  is  not 
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(a)  Original  data  (b)  Clustering  Result 


Figure  1.3:  The  three  well-separated  clusters  can  be  easily  detected  by  most  clustering  algorithms. 

Images  in  this  thesis/dissertation  are  presented  in  color. 


surprising  that  many  clustering  algorithms  have  been  proposed  to  address  the  different  needs  of 
“good  clusters”  in  different  scenarios. 

In  this  section,  we  attempt  to  provide  a  taxonomy  of  the  major  clustering  techniques,  present  a 
brief  history  of  cluster  analysis,  and  present  the  basic  ideas  of  some  popular  clustering  algorithms 
in  the  pattern  recognition  community. 

1.3.1  A  Taxonomy  of  clustering 

Many  clustering  algorithms  have  been  proposed  in  different  application  scenarios.  Perhaps  the 
most  important  way  to  classify  clustering  algorithms  is  hierarchical  versus  partitional.  Hierarchical 
clustering  creates  a  tree  of  objects,  where  branches  merging  at  the  lower  levels  correspond  to  higher 
similarity.  Partitional  clustering,  on  the  other  hand,  aims  at  creating  a  “flat”  partition  of  the  set  of 
objects  with  each  object  belonging  to  one  and  only  one  group. 

Clustering  algorithms  can  also  be  classified  by  the  type  of  input  data  used  (pattern  matrix 
or  similarity  matrix),  or  by  the  type  of  the  features,  e.g.  numerical,  categorical,  or  special  data 
structures,  such  as  rank  data,  strings,  graphs,  etc.  (See  Section  1.1.1  for  information  on  different 
types  of  data.)  Alternatively,  a  clustering  algorithm  can  be  characterized  by  the  probability  model 
used,  if  any,  or  by  the  core  search  (optimization)  process  used  to  find  the  clusters.  Hierarchical 
clustering  algorithms  can  be  described  by  the  clustering  direction,  either  agglomerative  or  divisive. 

In  Figure  1.5,  we  provide  one  possible  hierarchy  of  partitional  clustering  algorithms  (modified 
from  [131]).  Heuristic-based  techniques  refer  to  clustering  algorithms  that  optimize  a  certain  notion 
of  “good”  clusters.  The  goodness  function  is  constructed  by  the  user  in  a  heuristic  manner.  Model- 
based  clustering  assumes  that  there  are  underlying  (usually  probabilistic)  models  that  govern  the 
clusters.  Density-based  algorithms  attempt  to  estimate  the  data  density  and  utilize  that  to  construct 
the  clusters. 

One  may  further  sub-divide  heuristic-based  techniques  depending  on  the  input  type.  If  a  pattern 
matrix  is  used,  the  algorithm  is  usually  prototype-based,  i.e.,  each  cluster  is  represented  by  the  most 
typical  “prototype.”  The  fc-means  and  the  fc-medoids  algorithms  [79]  are  probably  the  best  known 
in  this  category.  If  a  dissimilarity  or  similarity  matrix  is  used  as  the  input,  two  sub-categories  are 
possible:  those  based  on  linkage  (single-link,  average-link,  complete-link,  and  CHAMELEON  [142]), 
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Figure  1.4:  Diversity  of  clusters.  The  seven  clusters  in  this  data  set  (denoted  by  the  seven  different 
colors),  though  easily  identified  by  human,  are  difficult  to  detect  automatically.  The  clusters  are  of 
different  shapes,  sizes,  and  densities.  The  presence  of  background  noise  makes  the  clustering  task 
even  more  difficult. 


and  those  inspired  from  graph  theory,  such  as  min-cut  [272]  and  spectral  clustering  [234,  194]. 
Model-based  algorithms  often  refer  to  clustering  by  using  a  finite  mixture  distribution  [184],  with 
each  mixture  component  interpreted  as  a  cluster.  Spatial  clustering  can  involve  a  probabilistic  model 
of  the  point  process.  For  density-based  methods,  the  mean-shift  algorithm  [45]  finds  the  modes  of 
the  data  densities  by  the  mean-shift  operation,  and  the  cluster  label  is  determined  by  which  “basin 
of  convergence”  a  point  is  located.  DENCLUE  [111]  utilizes  a  kernel  (non-parametric)  estimate  for 
the  data  density  to  find  the  clusters. 

1.3.2  A  Brief  History  of  Cluster  Analysis 

According  to  the  scholarly  journal  archive  JSTOR0,  the  first  appearance  of  the  word  “cluster”  in  the 
title  of  a  scholarly  article  was  in  1739  [11]:  “A  Letter  from  John  Bartram,  M.  D.  to  Peter  Collinson, 
F.  R.  S.  concerning  a  Cluster  of  Small  Teeth  Observed  by  Him  at  the  Root  of  Each  Fang  or  Great 
Tooth  in  the  Head  of  a  Rattle-Snake,  upon  Dissecting  It”.  The  word  “cluster”  here,  though,  was 
used  only  in  its  general  sense  to  denote  a  group.  The  phrase  “cluster  analysis”  first  appeared  in 
1954  and  it  was  suggested  as  a  tool  to  understand  anthropological  data  [43].  In  its  early  days, 
cluster  analysis  was  sometimes  referred  to  as  grouping  [48,  85],  and  biologists  called  it  “numerical 
taxonomy”  [242]. 

Early  research  on  hierarchical  clustering  was  mainly  done  by  biologists,  because  these  techniques 
helped  them  to  create  a  hierarchy  of  different  species  for  analyzing  their  relationship  systematically. 
According  to  [242],  single-link  clustering  [240],  complete-link  clustering  [213],  and  average-link  clus¬ 
tering  [241]  first  appeared  in  1957,  1948,  and  1958,  respectively.  Ward’s  method  [266]  was  proposed 
in  1963.  Partitional  clustering,  on  the  other  hand,  is  closely  related  to  data  compression  and  vector 
quantization.  This  link  is  not  surprising  because  the  cluster  labels  assigned  by  a  partitional  cluster- 
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Figure  1.5:  A  taxonomy  of  clustering  algorithms. 


ing  algorithm  can  be  viewed  as  the  compressed  version  of  the  data.  The  most  popular  partitional 
clustering  algorithm,  fc-means,  has  been  proposed  several  times  in  the  literature:  Steinhaus  in  1955 
[243],  Lloyd  in  1957  [174],  and  MacQueen  in  1967  [178].  The  ISODATA  algorithm  by  Ball  and  Hall  in 
1965  [8]  can  be  regarded  as  an  adaptive  version  of  fc-means  that  adjusts  the  number  of  clusters.  The 
k- means  algorithm  is  also  attributed  to  Forgy  (like  [140]  and  [99]),  though  the  reference  for  this  [88] 
only  contains  an  abstract  and  it  is  not  clear  what  Forgy  exactly  proposed.  The  historical  account  of 
vector  quantization  given  in  [99]  also  presents  the  history  of  some  of  the  partitional  clustering  algo¬ 
rithms.  In  1971,  Zahn  proposed  a  graph-theoretic  clustering  method  [280],  which  is  closely  related 
to  single-link  clustering.  The  EM  algorithm,  which  is  the  standard  algorithm  for  estimating  a  finite 
mixture  model  for  mixture-based  clustering,  is  attributed  to  Dempster  et  al.  in  1977  [58].  Interest 
in  mean-shift  clustering  was  revived  in  1995  by  Cheng  [40],  and  Comaniciu  and  Meer  further  popu¬ 
larized  it  in  [45] .  Hoffman  and  Buhmann  considered  the  use  of  deterministic  annealing  for  pairwise 
clustering  [115],  and  Fischer  and  Buhmann  modified  the  connectedness  idea  in  single-link  clustering 
that  led  to  path-based  clustering  [84].  The  normalized  cut  algorithm  by  Shi  and  Malik  [233]  in  1997 
is  often  regarded  as  the  first  spectral  clustering  algorithm,  though  similar  ideas  were  considered  by 
spectral  graph  theorists  earlier.  A  summary  of  the  important  results  in  spectral  graph  theory  can 
be  found  in  the  1997  book  by  Chung  [42].  The  emergence  of  data  mining  leads  to  a  new  line  of 
clustering  research  that  emphasizes  efficiency  when  dealing  with  huge  database.  DBSCAN  by  Ester 
et  al.  [77]  for  density-based  clustering  and  CLIQUE  by  Agrawal  et  al.  [2]  for  subspace  clustering 
are  two  well-known  algorithms  in  this  community. 

The  current  literature  on  cluster  analysis  is  vast,  and  hundreds  of  clustering  algorithms  have 
been  proposed  in  the  literature.  It  will  require  a  tremendous  effort  to  list  and  summarize  all  the 
major  clustering  algorithms.  The  reader  is  encouraged  to  refer  to  a  survey  like  [130]  or  [79]  for  an 
overview  of  different  clustering  algorithms. 


1.3.3  Examining  Some  Clustering  Algorithms 

In  this  section,  we  will  examine  two  very  important  clustering  algorithms  used  in  the  pattern  recog¬ 
nition  community:  the  fc-means  algorithm  and  the  EM  algorithm.  Other  clustering  algorithms  that 
are  used  regularly  in  pattern  recognition  include  the  mean-shift  algorithm  [45,  44,  40],  pairwise  clus- 


13 


tering  [115,  116],  path-based  clustering  [84,  83],  and  spectral  clustering  [234,  139,  269,  194,  258,  42], 
Let  {yj, . . . ,  yn}  be  the  set  of  n  d-dimensional  data  points  to  be  clustered.  The  cluster  label  of 
yj  is  denoted  by  z j.  The  goal  of  (partitional)  clustering  is  to  recover  z^,  with  z^  €  {1, . . . ,  fc},  where 
fc  denotes  the  number  of  clusters  specified  by  the  user.  The  set  of  y^  with  z^  =  j  is  referred  to  as 
the  j-th  cluster. 

1.3.3. 1  The  fc-means  algorithm 

The  fc-means  algorithm  is  probably  the  best  known  clustering  algorithm.  In  this  algorithm,  the  j-tli 
cluster  is  represented  by  the  “cluster  prototype”  (Xj  in  R^.  Clustering  is  done  by  finding  Zj  and  /x^- 
that  minimize  the  following  cost  function: 

n  n  k 

^k— means  =  ~  ^Zj\\  =  zi  =  j)llyj  M  j  1 1  •  (1-1) 

i= 1  i=lj=l 

Here,  I(z^  =  j )  denotes  the  indicator  function,  which  is  one  if  the  condition  z(  =  j  is  true,  and  zero 
otherwise.  To  optimize  means’  we  assume  that  all  fij  are  specified.  The  values  of  z j  that 
minimize  means  are  §iyen  by 

^  =  argmin||yi-/u?-||2.  (1.2) 

J 

On  the  other  hand,  if  z j  is  fixed,  the  optimal  H  j  can  be  found  by  differentiating  ./^._means  with 
respect  to  H  j  and  setting  the  derivatives  to  zero,  leading  to 

=  Ej=i  =  j)Vj  =  j  t1;/ 

^3  ^  I(zi  =  j)  number  of  i  with  z^=  j 

Starting  from  an  initial  guess  on  fij,  the  fc-means  algorithm  iterates  between  Equations  (1.2)  and 
(1.3),  which  is  guaranteed  to  decrease  the  k- means  objective  function  until  a  local  minimum  is 
reached.  In  this  case,  /x j  and  Z;L  remain  unchanged  after  the  iteration,  and  the  fc-means  algorithm 
is  said  to  have  converged.  The  resulting  z^  and  (J,,j  constitute  the  clustering  solution.  In  practice, 
one  can  stop  if  the  change  in  successive  values  of  means  ^ess  than  a  threshold. 

The  fc-means  algorithm  is  easy  to  understand  and  is  also  easy  to  implement.  However,  k- means 
has  problems  in  discovering  clusters  that  are  not  spherical  in  shape.  It  also  encounters  some  difficul¬ 
ties  when  different  clusters  have  a  significantly  different  number  of  points,  fc-means  also  requires  a 
good  initialization  to  avoid  getting  trapped  in  a  poor  local  minimum.  In  many  cases,  the  user  does 
not  know  the  number  of  clusters  in  advance,  which  is  required  by  fc-means.  The  problem  of  deter¬ 
mining  the  value  of  k  automatically  still  does  not  have  a  very  satisfactory  solution.  Some  heuristics 
have  been  described  in  [125],  and  a  recent  paper  on  this  is  [106]. 

Because  the  fc-means  algorithm  alternates  between  the  two  conditions  of  optimality,  it  is  an 
example  of  alternating  optimization.  The  fc-means  clustering  result  can  be  interpreted  as  a  solution 
to  vector  quantization,  with  a  codebook  of  size  fc  and  a  square  error  loss  function.  Each  fi  j  is  a 
codeword  in  this  case.  The  fc-means  algorithm  can  also  be  viewed  as  a  special  case  of  fitting  a 
Gaussian  mixture,  with  covariance  matrices  of  all  the  mixture  components  fixed  to  be  crzI  and  a 
tends  to  zero  (for  the  “hard”  cluster  assignment).  The  k-medoid  algorithm  is  similar  to  fc-means, 
except  that  /xy  is  restricted  to  be  one  of  the  given  patterns  y,j . 

There  is  also  an  online  version  of  fc-means.  When  the  i-th  data  point  y%  is  observed,  the  cluster 
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center  fij  that  is  the  nearest  to  is  found,  p,j  is  then  updated  by 

M°eW  =  Pj  +  «(y i  ~  Pj),  (1-4) 

where  a  is  the  learning  rate.  This  learning  rule  is  an  example  of  “winner-take-all”  in  competitive 
learning,  because  only  the  cluster  that  “wins”  the  data  point  can  learn  from  it. 


1.3. 3. 2  Clustering  by  Fitting  Finite  Mixture  Model 

The  fc-means  algorithm  is  an  example  of  “hard”  clustering,  where  a  data  point  is  assigned  to  only 
one  cluster.  In  many  cases,  it  is  beneficial  to  consider  “soft”  clustering,  where  a  point  is  assigned  to 
different  clusters  with  different  degrees  of  certainties.  This  can  be  done  either  by  fuzzy  clustering  or 
by  mixture-based  clustering.  We  prefer  the  latter  because  it  has  a  more  rigorous  foundation. 

In  mixture-based  clustering,  a  finite  mixture  model  is  fitted  to  the  data.  Let  Y  and  Z  be  the 
random  variables  for  a  data  point  and  a  cluster  label,  respectively.  Each  cluster  is  represented  by 
the  component  distribution  p(Y\9j),  where  9j  denotes  the  parameter  for  the  j-th  cluster.  Data 
points  from  the  j-th  cluster  are  assumed  to  follow  this  distribution,  i.e. ,  p(Y\Z  =  j)  =  p(Y\6  j).  The 
component  distribution  p(Y\6j)  is  often  assumed  to  be  a  Gaussian  when  Y  is  continuous,  and  the 
corresponding  mixture  model  is  called  “a  mixture  of  Gaussians” .  If  Y  is  categorical,  multinomial 
distribution  can  be  used  for  p(Y\0j).  Let  aj  =  P(Z  =  j )  be  the  prior  probability  for  the  j-th 
cluster.  The  key  idea  of  a  mixture  model  is 

k  k 

P(Y |0)  =  ^  P(Y\Z  =  j)P(Z  =  j)  =  E  ajP(Y\ffj)>  (!-5) 

3= 1  j=1 


where  0  =  {6\, . . . ,  9j,,  ctq, . . . ,  aj,.}  contains  all  the  model  parameters.  The  mixture  model  can  be 
understood  as  a  two-stage  data  generation  process.  First,  the  hidden  cluster  label  Z  is  sampled 
from  a  multinomial  distribution  with  parameters  (a\, . . . ,  ajZj.  The  data  point  Y  is  then  generated 
according  to  the  mixture  distribution  determined  by  Z ,  i.e.,  Y  is  sampled  from  p(Y\9 j)  if  Z  =  j. 

The  degree  of  membership  of  y.j  to  the  j-th  cluster  is  determined  by  the  posterior  probability  of 
Z  equals  to  j  given  y,t,  i.e., 


P(Z=j\Y=yi) 


P{Z  =  j,Y  =  yj) 

P(Y  =  y  i) 


(*jP(Y\9j) 

T,j=lajP(Y\0j) 


(1.6) 


If  a  “hard”  clustering  is  needed,  y^  can  be  assigned  to  the  cluster  with  the  highest  posterior  proba¬ 
bility  P(Z\Y  =  yj). 

The  parameter  0  can  be  determined  using  the  maximum  likelihood  principle.  We  seek  0  that 
minimizes  the  negative  log-likelihood: 


n  k 

^mixture  =  ~  ajP(yi\®j)-  (1-7) 

i= 1  3= 1 


For  brevity  of  notation,  we  write  p(y{\6j)  to  denote  p(Y  =  y7;| 9j). 

The  EM  algorithm  can  be  used  to  optimize  */mixture-  is  a  powerful  technique  for  parameter 
estimation  when  some  of  the  data  are  missing.  In  the  context  of  a  finite  mixture  model,  the  missing 
data  are  the  cluster  labels.  Starting  with  an  initial  guess  of  the  parameters,  the  EM  algorithm 
alternates  between  the  “E-step”  and  the  “M-step”.  Let  r^j  =  P{Z  =  j\Y  =  y.j,0°E,  where 
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0°^  is  the  current  parameter  estimate.  In  the  E-step,  we  compute  the  expected  complete  data 
log-likelihood,  also  known  as  the  Q-function: 


Q(0||0old)  =  E 


n 

£iogp(yi>**) 

i= 1 


n  k 

=  Z  Z  rij  (l°saj  +  lo&p(yi\°j)) 
i=lj=l 


(1.8) 


Note  that  the  expectation  is  done  with  respect  to  the  old  parameter  value  via  r;Lj .  Computationally, 
E-step  requires  calculation  of  r^j.  In  the  M-step,  0  that  maximizes  (5(0||0°^d)  is  found: 

©new  _  argnmxQ(0||0old).  (1.9) 

The  M-step  is  guaranteed  to  decrease  </mixture-  By  repeating  the  E-step  and  the  M-step,  the 
negative  log-likelihood  continues  to  decrease  until  a  local  minimum  is  reached. 


Convergence  Proofs  on  the  EM  algorithm  In  this  section,  we  shall  state  the  well-known  proof 
in  the  literature  that  the  M-step  indeed  decreases  -^mixture '  thereby  showing  that  the  EM  algorithm 
does  converge  to  a  local  minimum  of  ^mixture’  We  consider  the  correctness  of  the  EM  algorithm  in 
a  more  general  setting,  where  Y  and  Z  are  redefined  to  mean  “observed  data”  and  “missing  data,” 
respectively.  Note  that  the  data  points  and  the  missing  labels  are  examples  of  observed  data  and 
missing  data,  respectively. 

In  this  general  setting,  Q(0||©o^d)  can  be  written  as 

g(0||0old)  =  J2p(z\Y,eold)\ogP(Y,z\e)  (i.io) 

z 


Our  first  proof  is  based  on  the  concavity  of  the  logarithm  function.  Because  M-step  maximizes 
Q(0),  Q(0new)  -  Q(0old)  >  0.  Observe  that 

Q(0new)  -  Q(0old) 

=  J2p(Z\Y,  0old)  (logp(Y,  Z|0new)  -  logp(y,  Z|0old)) 

Z 

=  logp(F|0new)  -  logp(F|0old)  +  ^p(Z\Y,  0old)  log 

<  logp(F|0new)  -  logp(F|0old)  +  log^p(Z|F,  0°ld)^ 

=  logp(F|0new)  -  logp(F|0old). 


The  inequality  is  due  to  the  concavity  of  logarithm,  and  the  fact  that  p(Z\Y,  0°^d)  can  be  viewed  as 
“weights”  because  they  are  non-negative  and  J2zP(Z\Y,  ©°^d)  =  1.  Since  Q(0new)  —  Q(0°^d)  >  0, 
the  above  implies  logp(F|0new)  —  logp(F|0°^d)  >  0.  So,  the  update  of  parameter  from  0°^d  to 
©new  in(jeeci  improves  the  log-likelihood  of  the  observed  data.  When  0°^d  =  0new,  the  inequality 
becomes  an  equality,  and  we  reach  a  local  minimum  of  logp(y|0). 

Note  that  the  above  argument  holds  as  long  as  Q(0new)  —  <5(0°^d)  >  0.  Thus  it  suffices  to 
increase  (instead  of  maximizes)  the  expected  complete  log-likelihood  in  the  M-step.  The  resulting 
algorithm  that  only  increases  the  expected  complete  log-likelihood  is  known  as  the  generalized  EM 
algorithm. 
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It  is  interesting  to  note  a  variant  of  the  EM  algorithm  used  in  [80]  for  Bayesian  pa¬ 
rameter  estimation.  The  goal  is  to  find  0  that  maximizes  logp(0|V).  Since  the  missing 
data  in  [80]  are  continuous,  the  expectation  is  performed  by  integration  instead  of  summa¬ 
tion.  The  E-step  computes  f  p(@\Z,Y)\ogp(Z\Q°^d,Y)  dZ ,  and  the  M-step  solves  0new  = 
argrna xq  / p(Z\Q°^d,Y)logp(Q\Z,Y)  dZ.  The  correctness  of  this  variant  of  the  EM  algorithm 
can  be  seen  by  the  following: 

J p(Z\Qold,Y)\ogp(Qnew\Z,Y)  dZ-  J p(Z\Qold,Y)\ogp(Qold\Z,Y)  dZ 

=  J  p(Z\eold,Y)(\ogp(Qnew\Y)  +  logp(Z\enew,Y)~\ogP(Z\Y) 

-\ogp{eold\Y)  ~\ogp(Z\Qold,Y)  + log P(Z\Y))  dZ 

=  logp(0new|T)  —  logp(0old|F)  +  J p(Z|Oold,  Y)  log  ^|Q^d dZ 

<  logp(0new|F)  -  logp(0old|F) 

Note  that  p(0|Z,  Y)  =  p(Q\Y)p{Z\Q,  Y)/p{Z\Y). 

Our  second  proof  of  the  EM  algorithm  is  to  regard  it  as  a  special  case  of  variational  method. 
Here,  we  follow  the  presentation  in  [205].  Let  T{Z)  be  an  unknown  variable  distribution  on  the 
missing  data  Z.  Since  p(Y|0)  =  p(Y,  Z\Q)/p{Z\Y,  0),  we  have 


logp(T |0)  =  logp(Y,  Z|0)  -  logp(Z|Y,  0) 

logp(Y|0)  =  ^  T(Z)  log p(Y,  Z |0)  -  ]T  T(Z)  logp(Z\Y,  0) 

Z  Z 

=  J2  T{Z)  iog^ffU  Dkl(T(Z)\\p(Z\Y,  6)) 
Here,  Dj^j^{T(Z)\\p(Z\Y))  is  the  Kullback  Leibler  divergence  defined  as 

DKL(TQ(Z)MZ\Y))  =  Y,TQ(Z)  log^py- 

z 


Note  that  the  divergence  is  always  nonnegative,  meaning  that  s  =  ^2%  T(Z)  log  is  a  lower- 

bound  of  logp(Y|0).  Variational  method  maximizes  logp(V|0)  indirectly  by  finding  0  and  T{Z) 
that  maximizes  s,  under  a  restriction  on  the  form  of  T(Z).  The  EM  algorithm  can  be  regarded  as 
a  special  case  of  variational  method,  which  does  not  put  any  restriction  on  T(Z).  It  is  easy  to  show 
that  in  this  case  s  is  maximized  with  respect  to  T(Z)  if  T(Z)  =  p(Z\Y,<9).  With  this  choice  of 
T(Z),  s  is  no  longer  a  lower  bound  but  exactly  equals  logp(V|0),  because  the  divergence  term  is 
zero.  Maximizing  s  with  respect  to  0  is  the  same  as  maximizing  ^7  p(Z\Y,  0)  logp(Y,  Z 1 0) ,  which 
is  the  Q-function. 


1.4  Side-Information 

In  many  pattern  recognition  problems,  the  performance  of  advanced  classifiers  like  support  vector 
machines  and  simple  classifiers  like  k- nearest  neighbors  are  more  or  less  the  same.  It  is  the  “quality” 
of  the  input  information  (in  terms  of  discrimination  power),  instead  of  the  type  of  the  classifier,  that  is 
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the  determining  factor  in  the  classification  accuracy.  However,  research  effort  in  pattern  recognition 
and  machine  learning  has  focused  on  devising  better  classifiers.  One  is  more  likely  to  improve  the 
performance  of  practical  systems  by  incorporating  additional  domain/contextual  information,  than 
by  improving  the  classifier.  Side-information ,  i.e. ,  information  other  than  what  is  contained  in 
feature  vectors  and  class  labels,  is  relevant  here  because  it  provides  alternative  means  for  the  system 
designer  to  input  more  prior  knowledge  into  the  classficiation/clustering  system,  therefore  boosting 
its  performance. 

Side-information  arises  because  some  aspects  of  a  pattern  recognition  problem  cannot  be  specified 
via  the  class  labels  and  the  feature  vectors.  It  can  be  viewed  as  a  complement  to  the  given  pattern  or 
proximity  matrix.  Examples  of  side-information  include  alternative  metrics  between  objects,  known 
data  groupings  or  associations,  additional  labels  or  attributes  (such  as  soft  biometric  traits  [123]), 
relevance  of  different  features,  and  ranks  of  the  objects. 

Side-information  is  particularly  valuable  to  clustering,  owing  to  the  inherent  arbitrariness  in  the 
notion  of  a  cluster.  Given  different  possibilities  to  cluster  a  data  set,  side  information  can  help  us  to 
identify  the  cluster  structure  that  is  the  most  appropriate  in  the  context  that  the  clustering  solution 
will  be  used.  A  set  of  constraints,  which  specify  the  relationship  between  different  cluster  labels, 
is  probably  the  most  natural  type  of  side-information  in  clustering.  Constraints  arise  naturally  in 
many  clustering  applications.  For  example,  in  image  segmentation  one  can  have  partial  grouping 
cues  for  several  regions  in  the  image  to  assist  in  the  overall  clustering  [279].  Clustering  of  customers 
in  a  market-basket  database  can  have  multiple  records  pertaining  to  the  same  person.  In  video 
retrieval  tasks  different  users  may  provide  alternative  annotations  of  images  in  small  subsets  of 
a  large  database  [110].  Such  groupings  may  be  used  for  semi-supervised  clustering  of  the  entire 
database.  “Orthogonality”  to  a  known  or  trivial  partition  of  the  data  set  is  another  type  of  side- 
information  for  clustering,  and  this  requirement  can  be  incorporated  via  a  variant  of  information 
bottleneck  [97]. 

1.5  Overview 

In  the  remainder  of  this  thesis,  we  shall  first  provide  an  in-depth  survey  of  some  nonlinear  dimen¬ 
sionality  reduction  methods  in  Chapter  2.  We  then  present  our  work  on  how  to  convert  ISOMAP, 
one  of  the  algorithms  described  in  Chapter  2,  to  its  incremental  version  in  Chapter  3.  In  Chap¬ 
ter  4,  we  present  our  algorithm  on  the  problem  of  estimating  the  relevance  of  different  features  in  a 
clustering  context.  Chapter  5  describes  our  proposed  approach  to  perform  model-based  clustering 
in  the  presence  of  constraints.  Finally,  we  conclude  with  some  of  our  contributions  to  the  field  and 
outline  some  research  directions  in  Chapter  6. 
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Chapter  2 


A  Survey  of  Nonlinear 
Dimensionality  Reduction 
Algorithms 


In  section  1.2  we  described  the  importance  of  dimensionality  reduction  and  presented  an  overall 
picture  of  different  approaches  for  dimensionality  reduction.  This  chapter  continues  the  discussion 
in  section  1.2. 3. 2,  where  linear  feature  extraction  methods  like  principal  component  analysis  (PCA) 
and  linear  discriminant  analysis  (LDA)  were  mentioned.  Linear  methods  are  easy  to  understand 
and  are  very  simple  to  implement,  but  the  linearity  assumption  does  not  hold  in  many  real  world 
scenarios.  Images  of  handwritten  digits  do  not  conform  to  the  linearity  assumption  [113];  rotation, 
shearing,  and  variation  of  stroke  widths  can  at  best  be  approximated  by  linear  functions  only  in  a 
small  neighborhood  (as  in  the  use  of  tangent  distance  [68]).  A  transformation  as  simple  as  translating 
an  object  on  a  uniform  background  cannot  be  represented  as  a  linear  function  of  the  pixels.  This 
has  motivated  the  design  of  nonlinear  mapping  methods  in  a  general  setting.  Note,  however,  that 
a  globally  nonlinear  mapping  can  often  be  approximated  by  a  linear  mapping  in  a  local  region.  In 
fact,  this  is  the  essence  of  many  of  the  algorithms  considered  in  this  chapter. 

In  this  chapter,  we  shall  survey  some  of  the  recent  nonlinear  dimensionality  reduction  algorithms, 
with  an  emphasis  on  several  algorithms  that  perform  nonlinear  mapping  via  the  notion  of  learning 
the  data  manifold.  Since  we  are  mostly  interested  in  unsupervised  learning,  supervised  nonlinear 
dimensionality  methods  such  as  hierarchical  discriminant  regression  (HDR)  [118]  are  omitted  from 
this  survey.  Some  of  the  methods  considered  in  this  chapter  have  also  been  surveyed  recently  in 
[284]  and  [34], 

2.1  Overview 

The  history  of  nonlinear  mapping  is  long,  tracing  back  to  Sammon’s  mapping  in  1969  [223].  Over 
time,  different  techniques  have  been  proposed,  such  as  projection  pursuit  [93]  and  projection  pursuit 
regression  [92],  self  organizing  maps  (SOM)  [152],  principal  curve  and  its  extensions  [107,  249,  239, 
144],  auto-encoder  neural  networks  [7,  57],  generative  topographic  maps  (GTM)  [24],  and  kernel 
principal  component  analysis  [228].  A  comparison  of  some  of  these  methods  can  be  found  in  [180]. 

A  new  line  of  nonlinear  mapping  algorithms  has  been  proposed  recently  based  on  the  notion  of 
manifold  learning.  Given  a  data  set  that  is  assumed  to  be  lying  approximately  on  a  (Riemannian) 
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Figure  2.1:  An  example  of  a  manifold.  This  example  is  usually  known  as  the  “Swiss  roll”,  (a) 
Surface  of  the  manifold,  (b)  Data  points  lying  on  the  manifold. 


manifold  in  a  high  dimensional  space,  dimensionality  reduction  can  be  achieved  by  constructing 
a  mapping  that  respects  certain  properties  of  the  manifold.  Isometric  feature  mapping  (ISOMAP) 
[248],  locally  linear  embedding  (LLE),  Laplacian  eigenmap  [16],  semidefmite  embedding  [268],  chart¬ 
ing  [29],  and  co-ordination-based  ideas  [220,  257]  are  some  of  the  examples.  The  utility  of  manifold 
learning  has  been  demonstrated  in  different  applications,  such  as  face  pose  detection  [103,  172],  face 
recognition  [283,  276],  analysis  of  facial  expressions  [75,  38],  human  motion  data  interpretation  [133], 
gait  analysis  [75,  74],  visualization  of  fiber  traces  [32],  and  wood  texture  analysis  [196]. 

In  this  chapter,  we  shall  review  some  of  these  algorithms,  with  an  emphasis  towards  the  manifold- 
based  nonlinear  mapping  algorithms.  It  is  hoped  that  this  exposition  can  help  the  reader  to  become 
familiar  with  these  recent  exciting  developments  in  nonlinear  dimensionality  reduction.  Table  2.1 
provides  a  comparison  of  the  algorithms  we  are  going  to  discuss.  We  want  to  point  out  that  there 
are  many  other  interesting  manifold-related  ideas  that  have  been  omitted  in  this  chapter.  Examples 
include  stochastic  embedding  [112],  locality  preserving  projections  [109],  Hessian  eigenmap  [67], 
semidefinite  embedding  [268]  and  its  extension  [267],  the  co-ordination  type  methods  described  in 
[257],  [134]  and  [285],  as  well  as  the  method  in  [31]  which  is  related  to  Laplacian  eigenmap.  Robust 
statistics  techniques  can  be  used  too  [214].  It  is  also  possible  to  learn  a  Parzen  window  along  the 
data  manifold  [260] . 

The  rest  of  this  chapter  is  organized  as  follows.  We  first  define  our  notation  and  describe  some 
properties  of  a  manifold  in  Section  2.2.  Sammon’s  mapping,  probably  the  earliest  nonlinear  mapping 
algorithm,  is  discussed  in  Section  2.3.  Auto-associative  neural  network  [7],  also  known  as  auto¬ 
encoder  neural  networks  [57],  is  described  in  Section  2.4.  Kernel  PCA  is  described  in  Section  2.5, 
followed  by  ISOMAP  in  Section  2.6,  LLE  in  Section  2.7,  and  Laplacian  eigenmap  in  Section  2.8. 
Three  closely  related  ideas  that  involve  combining  different  local  co-ordinates  are  described  in  Section 
2.9.  We  then  show  some  results  of  running  these  algorithms  on  simple  data  sets  in  Section  2.10. 
Finally,  we  summarize  our  survey  in  Section  2.11. 


2.2  Preliminary 

Let  y  =  {y^, . . . ,  y n}  be  the  high-dimensional  data  set,  where  y^  £  R^  and  D  is  usually  large. 
Let  Y  =  [yj_, . . . ,  y n]  be  the  D  x  n  data  matrix.  We  seek  a  transformation  of  y  that  maps  y^  to 
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Key  idea 

Key  Computation 

Iterative 

Parameters 

Manifold 

Sammon  [223] 

Minimize  Sammon’s  stress 

Gradient  descent 

Yes 

none 

No 

auto-associative  neural 

networks  [7,  57] 

Neural  network  for  feature  extraction  and 
data  reconstruction 

Neural  network  training 

Yes 

none 

No 

KPCA  [228] 

PCA  in  feature  space 

eigenvectors  of  a  large,  full  ma¬ 
trix 

No 

kernel  function 

No 

ISOMAP  [248] 

Preserve  geodesic  distances 

all  pair  shortest  path;  eigenvec¬ 
tors  of  a  large,  full  matrix 

No 

neighborhood 

Yes 

LLE  [226] 

Same  reconstruction  weights 

eigenvectors  of  a  large,  sparse 
matrix 

No 

neighborhood 

Yes 

Laplacian  eigenmap  [16] 

Smooth  graph  embedding 

eigenvectors  of  a  large,  sparse 
matrix 

No 

neighborhood; 
width  parameter 

Yes 

Global  coordination  [220] 

Mixture  of  factor  analyzers;  Unimodal  pos¬ 
terior  of  global  coordinate 

EM  algorithm  derived  by  varia¬ 
tional  principle 

Yes 

Number  of  local 
models 

Yes 

Charting  [29] 

Nearby  Gaussians  are  similar;  Global  coor¬ 
dinate  by  least  square 

constrained  linear  equations; 
eigenvectors  of  a  small,  full 
matrix 

No 

neighborhood  (see 
caption) 

Yes 

LLC  [247] 

Local  models  are  given;  Global  coordinate 
by  LLE  criterion 

generalized  eigenvectors  of  a 
small,  full  matrix 

No 

a  mixture  model  fit¬ 
ted  to  the  data 

Yes 

Table  2.1:  A  comparison  of  nonlinear  mapping  algorithms  reviewed  in  this  chapter.  The  dimensionality  of  the  low  dimensional  space  is  assumed  to  be 
known,  though  some  of  these  algorithms  can  also  estimate  it.  “Neighborhood”  refers  to  7V(xj)  defined  in  section  2.2.  If  an  algorithm  is  inspired  from 
the  notion  of  a  manifold,  we  put  an  entry  of  “yes”  in  the  “Manifold”  column.  Note  that  the  neighborhood  in  charting  [29]  can  be  estimated  from  the 
data  instead  of  having  to  be  specified  by  the  user.  A  matrix  is  “large”  if  its  size  is  n  by  n,  where  n  is  the  size  of  the  data  set.  Only  the  few  leading 
or  trailing  eigenvectors  are  needed  if  eigen-decomposition  is  performed. 


its  low  dimensional  counterpart  x  j .  where  Xj  €  R^  and  d  is  small.  Let  X  =  [xj_ , . . .  ,  xn]  be  the 
d  x  n  matrix.  We  shall  assume  that  different  y?;  do  not  lie  randomly  in  R^,  but  approximately  on 
a  manifold,  which  is  denoted  by  A4 .  The  manifold  may  simply  be  a  hyperplane,  or  it  can  be  more 
complicated.  An  example  of  a  “curved”  manifold  with  the  data  points  lying  on  it  can  be  seen  in 
Figure  2.1.  This  manifold  assumption  is  reasonable  because  many  real  world  phenomena  are  driven 
by  a  small  number  of  latent  factors.  The  high  dimensional  feature  vectors  observed  are  the  results  of 
applying  a  (usually  unknown)  mapping  to  the  latent  factors,  followed  by  the  introduction  of  noise. 
Consequently,  high  dimensional  vectors  in  practice  lie  approximately  on  a  low  dimensional  manifold. 

Strictly  speaking,  what  we  refer  to  as  “manifold”  in  this  thesis  should  properly  be  called  “Rie- 
mannian  manifold.”  A  Riemannian  manifold  is  smooth  and  differentiable,  and  contains  the  notion 
of  length.  We  leave  the  precise  definition  of  Riemannian  manifold  to  encyclopedias  like  Math  world  ^ 
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and  Wikipedia  ,  and  describe  only  some  of  its  properties  here.  Every  y  in  the  manifold  Af  has  a 
neighborhood  N( y)  that  is  homeomorphic^  to  a  set  S,  where  S  is  either  an  open  subset  of  R^,  or 
an  open  subset  on  the  closed  half  of  R^. 

This  mapping  cj> y  :  N(y)  i— >  S'  is  called  a  co-ordinate  chart,  and  </> y(y)  is  called  the  “co-ordinate” 
of  y.  A  collection  of  co-ordinate  charts  that  covers  the  entire  A4  is  called  an  atlas.  If  y  is  in  two 
co-ordinate  charts  4> y^  and  </>y2 ,  y  will  have  two  (local)  co-ordinates  </>y-^(y)  and  ((^(y).  These 
two  co-ordinates  should  be  “consistent”  in  the  sense  that  there  is  a  map  to  convert  between  (j> (y) 
and  0y2(y),  and  the  map  is  continuous  for  any  path  in  N( yp  D  N( y2).  For  any  yj  and  y j  in  A4, 
there  can  be  many  paths  in  A4  that  connect  y %  and  y ,y  The  shortest  of  such  paths  is  called  the 
geodesic^  between  yj  and  yj.  For  example,  the  geodesic  between  two  points  on  a  sphere  is  an  arc 
of  a  “great  circle”:  a  circle  whose  center  coincides  with  the  center  of  the  sphere  (Figure  2.2).  The 
length  of  the  geodesic  between  y,t  and  yj  is  the  geodesic  distance  between  yj  and  y j. 

To  perform  nonlinear  mapping,  one  can  assume  that  there  exists  a  mapping  <'/,gl(j|)al  (■)  that  maps 

all  points  on  M  to  R^.  The  “global  co-ordinate”  of  y,  denoted  by  x  =  <?->gi0j)a](y),  is  regarded  as 
the  low  dimensional  representation  of  y.  In  general,  such  a  mapping  may  not  exist "T  In  that  case, 
a  mapping  that  preserves  a  certain  property  of  the  manifold  can  be  constructed  to  obtain  x. 

Many  of  the  nonlinear  mapping  algorithms  that  are  manifold-based  require  a  concrete  definition  of 
IV (y^),  the  neighborhood  of  yj.  Two  definitions  are  commonly  used.  In  e-neighborhood,  y j  £  N(yt) 
if  | y j  —  y>j  |  <  e,  where  the  norm  is  the  Euclidean  distance  in  R  .  In  fcnn-neighborhood,  yj  £  IV(yj) 
if  y  j  is  one  of  the  k  nearest  neighbors  of  yj  in  y ,  or  vice  versa.  In  both  cases,  e  or  k  is  a  user-defined 
parameter,  fcnn  neighborhood  has  the  advantage  that  it  is  independent  of  the  scale  of  the  data, 
though  it  can  lead  to  too  small  a  neighborhood  when  the  number  of  data  points  is  large.  Note  that 
the  neighborhood  can  be  defined  in  a  data-driven  manner  [29]  instead  of  being  specified  by  a  user. 

2.3  Sammon’s  mapping 

Sammon’s  mapping  [223] ,  which  is  an  example  of  metric  least  square  scaling  [49] ,  is  perhaps  the  most 
well-known  algorithm  for  nonlinear  mapping.  Sammon’s  mapping  is  an  algorithm  for  multidimen- 


1http: //mathworld. wolf ram. com 

2http : //en2 . wikipedia . org/ 

3  Two  (topological)  spaces  are  homeomorphic  if  there  exists  a  continuous  and  invertible  function  between  the  two 
spaces,  and  that  the  inverse  function  is  also  continuous. 

4 Strictly  speaking,  geodesics  are  curves  with  zero  covariant  derivatives  of  their  velocity  vectors  along  the  curve.  A 
shortest  curve  must  be  a  geodesic,  whereas  a  geodesic  might  not  be  a  shortest  curve. 

5For  example,  there  is  no  such  map  (homeomorphism)  between  all  points  on  a  sphere  and  R1 2 3 4 5.  However,  if  we 
exclude  the  north  pole  of  a  sphere,  we  can  construct  such  a  mapping. 
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Figure  2.2:  An  example  of  a  geodesic.  For  two  points  A  and  B  on  the  sphere,  many  lines  (the 
dash-dot  lines)  can  be  drawn  to  connect  them.  However,  the  shortest  of  these  lines,  which  is  the 
solid  line  joining  A  and  B1  is  called  the  geodesic  between  A  and  B.  In  the  case  of  a  sphere,  the 
geodesic  is  simply  the  great  circle. 


sional  scaling  and  it  maps  a  set  of  n  items  into  an  Euclidean  space  based  on  the  dissimilarity  values. 
This  problem  is  related  to  the  metric  embedding  problem  considered  by  theoretical  computer  scien¬ 
tists  [119].  Sammon’s  mapping  can  be  used  for  dimensionality  reduction  if  the  dissimilarity  matrix 
is  based  on  the  Euclidean  distance  between  the  data  points  in  the  high  dimensional  space. 


Given  anbyn  matrix  of  dissimilarity  values  { S^j  } ,  where  S.jj  denotes  the  dissimilarity  between 
the  i-th  and  the  j-tii  items,  we  want  to  map  the  n  items  to  n  points  {x^ , . . . ,  X77,}  in  a  low  dimensional 
space,  such  that  the  distance  between  x,t  and  Xj  is  as  “close”  to  8jj  as  possible.  Many  different 
definitions  of  closeness  have  been  proposed,  with  the  “Sammon’s  stress” ,  defined  by  Samrnon,  being 
the  most  popular.  The  Sammon’s  stress  S  is  defined  by 

s=e(*j7%)2/e%.  <2-d 

i<j  i<j 


o 

where  d^j  =  ||xj  —  Xj\\  is  the  distance  between  x^  and  Xj.  The  quantity  ( djj  —  <5jj)Z  measures  the 

discrepancy  between  the  observed  dissimilarities  with  the  actual  distances.  It  is  weighted  by  6  A ^ 
because  if  the  dissimilarity  is  large,  we  should  be  more  tolerant  to  the  discrepancy.  The  division 
by  Yhi<j  8'i'j  makes  S  scale  free.  Samrnon  proposed  the  following  iterative  equation  to  find  x.(  that 
minimize  S 


x 


new 

ik 


xik 


-MF 


d2s 


dx 2, 
ik 


(2.2) 


where  MF  is  a  “magic  factor”,  usually  set  to  0.3  or  0.4.  Now,  differentiating  d2j  =  '}2jii(x^y—x-yY 
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with  respect  to  x^,  we  get 


2  djjddjj  2(xik  xjk^xik 

ddij  x —  Xjj, 


dxjk  d, 


So,  the  gradient  of  S  is 
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where  x^,  is  the  fc-th  component  in  .  For  the  second  order  information,  note  that 
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where  /(.)  is  the  indicator  function  defined  as 

/(true)  =  1  /(false)  =  0. 
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(2.3) 


(2.4) 


One  can  use  a  nonlinear  optimization  algorithm  other  than  Equation  (2.2)  to  minimize  S.  It  is 
also  possible  to  implement  Sammon’s  mapping  by  a  feed-forward  neural  network  [180]  or  in  an 
incremental  manner  [129].  Note  that  Sammon’s  mapping  is  “global”  and  considers  all  the  interpoint 
distances  between  the  n  items.  This  can  be  a  drawback  for  data  like  the  Swiss  roll  data  set,  where 
Euclidean  distances  between  pairs  of  points  that  are  far  away  from  each  other  do  not  reveal  the  true 
structure  of  the  data. 


2.4  Auto-associative  neural  network 

A  special  type  of  feed-forward  neural  network,  “auto-associative  neural  network”  [7,  57],  can  be 
used  for  nonlinear  dimensionality  reduction.  An  example  of  such  a  network  is  shown  in  Figure  2.3. 
The  idea  is  to  model  the  functional  relationship  between  and  yt  by  a  neural  network.  If  x^  is 
a  good  representation  for  y^,  it  should  contain  sufficient  information  to  reconstruct  via  a  neural 
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Figure  2.3:  Example  of  an  auto-associative  neural  network.  This  network  extracts  Xj  with  3  features 
from  the  given  data  yj  with  8  features. 


network  (decoding  network),  with  the  “decoding  layer”  as  its  hidden  layer.  To  obtain  x^  from  y^, 
another  neural  network  (encoding  network)  is  needed,  with  the  “encoding  layer”  as  its  hidden  layer. 
The  encoding  network  and  the  decoding  network  are  connected  so  that  the  output  of  the  encoding 
network  is  used  as  the  input  of  the  decoding  network,  and  both  of  them  correspond  to  x.( .  The 
high-dimensional  data  points  y,j  are  used  as  both  the  input  and  the  target  for  training  in  this  neural 
network.  Sum  of  square  error  can  be  used  as  the  objective  function  for  training.  Note  that  the 
neural  network  in  Figure  2.3  is  just  an  example;  alternative  architecture  can  be  used.  For  example, 
multiple  hidden  layers  can  be  used,  and  the  number  of  neurons  in  the  encoding  and  decoding  layers 
can  also  be  different. 

The  advantage  of  this  approach  is  that  mapping  a  new  y  to  the  corresponding  x  is  easy:  just  feed 
y  to  the  neural  network  and  extract  the  output  of  the  encoding  layer.  Also,  there  exists  a  number 
of  software  packages  for  training  neural  networks.  The  drawback  is  that  it  is  difficult  to  determine 
the  appropriate  network  architecture  to  best  reduce  the  dimension  for  any  given  data  set.  Also, 
training  of  a  neural  network  involves  an  optimization  problem  that  is  considerably  more  difficult 
than  the  eigen-decomposition  required  by  some  other  nonlinear  mapping  methods  like  ISOMAP, 
LLE,  or  Laplacian  eigenmap,  which  we  shall  examine  later  in  this  chapter. 


2.5  Kernel  PCA 

The  basic  idea  of  kernel  principal  component  analysis  (KPCA)  is  to  transform  the  input  patterns 
to  an  even  higher  dimensional  space  nonlinearly  and  then  perform  principal  component  analysis  in 
the  new  space.  It  is  inspired  from  the  success  of  the  support  vector  machines  (SVM)  [189]. 
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2.5.1  Recap  of  SVM 

Consider  a  mapping  4>  '■  e- ►  Ti,  where  Ti  is  a  Hilbert  space.  Ti  can  be,  for  example,  a  (very)  high 

dimensional  Euclidean  space.  By  convention,  and  Ti  are  called  the  input  space  and  the  feature 
space,  respectively.  The  point  in  is  first  transformed  into  the  Hilbert  space  Ti  by  0(yj).  SVM 
assumes  a  suitable  transformation  (f>( .)  such  that  the  transformed  data  set  is  more  linearly  separable 
in  Ti  than  in  R  ,  and  a  large  margin  classifier  in  7 i  is  trained  to  separate  the  transformed  data.  It 
turns  out  that  the  large  margin  classifier  can  be  trained  by  using  only  the  inner  product  between  the 
transformed  data  {<j)(yj),  (j)(yj)),  without  knowing  <(>(.)  explicitly.  Therefore,  in  practice,  the  kernel 
function  A"(y?;,y?;)  is  specified  instead  of  <(>(.),  where 


K(yvy  i)  =  {4>{yi),4>{y j))- 


Specifying  the  kernel  function  K(., .)  instead  of  the  mapping  has  the  advantage  of  computational 
efficiency  when  Ti  is  of  high  dimension.  Also,  this  allows  us  to  generalize  to  infinite  dimensional  7 i, 
which  happens  when  the  radial  basis  function  kernel  is  used.  This  use  of  kernel  function  to  replace 
an  explicit  mapping  is  often  called  “the  kernel  trick” .  Intuitively,  the  kernel  function,  being  an  inner 
product,  represents  the  similarity  between  y,j  and  y j. 

The  kernel  trick  can  be  illustrated  by  the  following  example  with  D  =  2.  Let  4>(yj)  = 
(y?p  \/2j/jiJ/^2>  2/^2'  ’  w^iere  y i  =  (Vili  Vi2)^  ■  The  kernel  function  corresponding  to  this  </>(.)  is 

O 

K{yi,yj)  =  (yuVji  +  Vi2Vj2 )  .  because 


K(yvY j) 


(ynvji  +  vi2Vj2)2 

ViiVji  +  2ynyjiyi2Vj2  +  v&fy 


2  nT 


(viv  ^ynyi2^i2)<<yji^yjiyj2^yj2) 


0(y7;)T^(y;)- 


Many  different  kernel  functions  have  been  proposed.  Polynomial  kernel,  defined  as  K(y^,yj)  = 

(y^yn  +  l)7  with  r  as  the  parameter  (degree)  of  the  kernel,  corresponds  to  a  polynomial  decision 
boundary  in  the  input  space.  The  radial  basis  function  (RBF)  kernel  is  defined  by  K(y^,yj)  = 

r\ 

exp(w||yj  —  yjlr),  where  lu  is  the  width  parameter.  SVM  classifiers  using  RBF  kernel  are  related  to 
RBF  neural  networks,  except  that  for  SVM,  the  centers  of  the  basis  functions  and  the  corresponding 
weights  are  estimated  by  the  quadratic  programming  solver  simultaneously  [229] .  The  choice  of  the 
appropriate  kernel  function  in  an  application  is  difficult  in  general.  This  is  still  an  active  research 
area,  with  many  principles  being  proposed  [121,  154,  227]. 


2.5.2  Kernel  PC  A 

One  important  lesson  we  can  learn  from  SVM  is  that  a  linear  algorithm  in  the  feature  space  corre¬ 
sponds  to  a  nonlinear  algorithm  in  the  input  space.  Different  types  of  nonlinearity  can  be  achieved 
by  different  kernel  functions.  Kernel  PCA  [228]  utilizes  this  to  generalize  PCA  to  become  nonlinear. 
For  ease  of  notation,  we  shall  assume  Ti  is  of  finite  dimension" . 

KPCA  follows  the  steps  of  the  standard  PCA,  except  the  data  set  under  consideration  is 


6  The  case  for  infinite  dimensional  7 ~i  is  similar,  with  operators  replacing  matrices  and  eigenfunctions  replacing 
eigenvectors. 
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(</>(yi),  •  •  • ,  </>(yn)}-  Let  0(yj)  be  the  “centered”  version  of  </>(y?;), 

1  n 

4>{yd  =  <t>(y  i)  —  Y  ^ yi )• 

v=\ 

The  covariance  matrix  C  is  given  by 

c  =  ^Z^(y*Myt)T- 

i 

The  eigenvalue  problem  Av  =  Cv  is  solved  to  find  the  (kernel)  principal  component  v.  Because 

v  =  xCv  = (^y;)Tv) . 

i 

v  is  in  the  subspace  spanned  by  0(y^),  and  it  can  be  written  as 

v  =  Yaj^yj^- 

j 

Denote  a  =  (oq, . . . ,  an)-  Let  K  be  the  symmetric  matrix  such  that  its  (*,j)-th  entry  K^j  is 
</>(y j)1  <t>(y j)-  Rewrite  Av  =  Cv  as 

\^2aj4>(yj)  =  -^ajK^iyi).  (2.5) 

3  ij 

By  multiplying  both  sides  with  ,  we  have 

^  ajKjj  =  —  aiKijK]j  VI,  (2-6) 

j  ij 

which,  in  matrix  form,  can  be  written  as 

XnKa  =  K  2a.  (2.7) 

~  ~  ~  O 

Since  K  is  symmetric,  K  and  have  the  same  set  of  eigenvectors.  This  set  of  eigenvectors  is  also 
the  solution  to  the  generalized  eigenvalue  problem  in  Equation  (2.7).  Therefore,  a,  and  hence  v, 
can  be  found  by  solving  Act  =  K a.  For  projection  purposes,  it  is  customary  to  normalize  v  to  norm 
one.  Since  ||v||^  =  cc^Kcc,  we  should  divide  a  by  v  a-^Ka.  To  perform  dimensionality  reduction 
for  y,  it  is  first  mapped  to  the  feature  space  as  <j>( y),  and  its  projection  on  v  is  given  by 

0(y)Tv  =  Y  ^(y)T«^(y  i)  =  aTky,  (2.8) 

i 
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where  ky  = 


4>{y\), . . . ,  4>(y)T 4>{yn) )  •  Finally,  by  rewriting  the  relationship 


i<ij  =  <Xy?;)r<Myj) 


T 


Tl  \  /  Tl 

=  I  ^(y*)  ^  ^  J2  0(y/)  ^(yj)  -  ^ 


z=i 


fc=i 


i  n 

=  ^(yi)T</>(y,-)  -  -  X)  ^(yz)T^(yj) 


z=i 


Tl  Tl  Tl 

0(yfc)T0(y*)  +  4  ^(yfc)T</>(yfc) 


fc=i 


fc=l  z=i 


in  matrix  form,  we  have 

K  =  H„KHn,  (2.9) 

where  H?1,  =  I  —  ^1  n,n  is  a  centering  matrix  with  1  n,n  denoting  a  matrix  of  size  n  by  n  with 
all  entries  one,  and  K  is  the  kernel  matrix  with  its  (i,j)~ th  entry  given  by  K(y^,yj).  A  similar 

expression  can  be  derived  for  </>(y)J  (j){y j)- 

KPCA  solves  the  eigenvalue  problem  of  a  n  by  n  matrix,  which  may  be  larger  than  the  D  by  D 
matrix  considered  by  PC  A.  Recall  D  is  the  dimension  of  yt .  The  number  of  possible  features  to  be 
extracted  in  KPCA  can  be  larger  than  D.  This  contrasts  with  the  standard  PCA,  where  at  most  D 
features  can  be  extracted.  An  interesting  problem  related  to  KPCA  is  how  to  map  z,  the  projection 
of  </>(y)  into  the  subspace  spanned  by  the  first  few  kernel  principal  components,  back  to  the  input 
space.  This  can  be  useful  for,  say,  image  denoising  with  KPCA  [185].  The  search  for  the  “best”  y' 
such  that  <f>(y r)  ~  z  is  known  as  the  pre-image  problem  and  different  solutions  have  been  proposed 
[160,  5], 

In  summary,  KPCA  consists  of  the  following  steps. 

1.  Let  K  be  the  kernel  matrix,  where  Kjj  =  b>(yr  y^)-  Compute  K  by 


K  =  H„KH 


n- 


2.  Solve  the  eigenvalue  problem  A  a  =  Ka  and  find  the  eigenvectors  corresponding  to  the  largest 
few  eigenvalues. 

3.  Normalize  a  by  dividing  it  by  V  a-^Ka. 

rn  ~ 

4.  For  any  y,  its  projection  to  a  principal  component  can  be  found  by  a 1  ky,  where 

ky  =  Hn(ky  -  -Kln  l), 

ky  =  (K( y, y]_ ), ...  ,K{ y, y n)  and  ln  j  is  a  n  by  1  vector  with  all  entries  equal  to  one. 


2.6  ISOMAP 

The  basic  idea  of  isometric  feature  map  (ISOMAP)  [248]  is  to  find  a  mapping  that  best  preserves  the 
geodesic  distances  between  any  two  points  on  a  manifold.  Recall  that  the  geodesic  distance  between 
two  points  on  a  manifold  is  defined  as  the  length  of  the  shortest  path  on  the  manifold  that  connects 
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the  two  points.  ISOMAP  constructs  a  mapping  from  y?;  to  the  x?;  (x,t  e  R^)  such  that  the  Euclidean 
distance  between  x^  and  Xj  in  R^  is  as  close  as  possible  to  the  geodesic  distance  between  y^  and  yj 
on  the  manifold. 

Geodesic  distances  are  hard  enough  to  find  when  the  manifold  is  known,  let  alone  in  the  current 
case  where  the  manifold  is  unknown  and  only  points  on  the  manifold  are  given.  So,  ISOMAP 
approximates  the  geodesic  distances  by  first  constructing  a  neighborhood  graph  to  represent  the 
manifold.  The  vertex  v ^  in  the  neighborhood  graph  G  =  (V.  E )  corresponds  to  the  high  dimensional 
data  point  y,j .  An  edge  e(i,j)  between  and  Vj  exists  if  and  only  if  y?;  is  in  the  neighborhood  of 
y j,  N(yj),  and  the  weight  of  this  edge  is  ||y,j  —  y j 1 1 .  Details  of  N( yj)  are  described  in  section  2.2. 
An  example  of  a  neighborhood  graph  is  shown  in  Figure  2.4(b)  for  the  data  shown  in  Figure  2.4(a). 
ISOMAP  approximates  a  path  on  the  manifold  by  a  path  in  the  neighborhood  graph.  The  geodesic 
between  y?;  and  y j  corresponds  to  the  shortest  path  between  14  and  Vj.  The  estimation  problem  of 
the  geodesic  distances  between  all  pairs  of  points  y7;  and  y j  thus  becomes  the  all-pairs  shortest  path 
problem  in  the  neighborhood  graph.  It  can  be  solved  [46]  either  by  the  Floyd- Warsliall  algorithm, 
or  by  Dijkstra’s  algorithm  with  different  source  vertices.  The  latter  is  more  efficient  because  the 
neighborhood  graph  is  sparse.  An  example  of  how  the  shortest  path  approximates  the  geodesic  is 
shown  in  Figure  2.4(c).  It  can  be  shown  that  the  shortest  path  distances  converge  to  the  geodesic 
distances  asymptotically  [18]. 

The  next  step  of  ISOMAP  finds  x^  that  best  preserve  the  geodesic  distances.  Let  g^j  denote  the 
estimated  geodesic  distance  between  y^  and  y.j ,  and  write  G  =  {g^j}  as  the  geodesic  distance  matrix. 
The  optimal  x;  can  be  found  by  applying  the  classical  scaling  [49] ,  a  simple  multi-dimensional  scaling 
technique.  Let  d^j  =  | x?-  —  x j  1 1 .  Without  loss  of  generality,  assume  =  0  We  have  the 

following: 


and 


dh  =  (**  -  x.;)T(Xj  -  x  ■)  =  Hxjll2 


£4  =  £ 

i  i 

,2 


|xj|2  +n\\xj\\2 


Y.(kj  =  2n£iix* 


13 


So’EiM2  =  i£4- 


1 

i  ij 


|xj||2-2xfxj 


2xf  = 1 E  4  + 1  r  4-  -  4  v  4  -  4, 

1  j  n  13  n  ij  2  ij  13 


13 


(2.10) 


If  we  replace  d^j  with  the  estimated  geodesic  distance  g.jj  in  Equation  (2.10),  bjj ,  the  target  inner 
product  between  x^  and  Xj ,  is  given  by 


(2.11) 


1  9 

Let  A  =  {(ijj}  with  a^j  =  Equation  (2.11)  means  that  B  =  HnAH?x,  where  B  =  {b^j}, 

H n  =  I  ^  and  1  n,n  denotes  a  n  by  n  matrix  with  all  entries  one. 
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Figure  2.4:  Example  of  neighborhood  graph  and  geodesic  distance  approximation,  (a)  Input  data, 
(b)  The  neighborhood  graph  and  an  example  of  the  shortest  path,  (c)  This  is  the  same  as  (b),  except 
the  manifold  is  flattened.  The  true  geodesic  (blue  line)  is  approximated  by  the  shortest  path  (red 
line) . 


30 


Computing  HnAH)j  is  effectively  a  centering  operation  on  A,  i.e. ,  each  column  is  subtracted  by 
its  corresponding  column  mean,  and  each  row  is  subtracted  by  its  corresponding  row  mean.  Because 
multiplication  of  Hn  has  this  effect  of  “zeroing”  the  means  for  different  rows  and  columns,  H n  is 
often  referred  to  as  the  centering  matrix.  The  centering  operation  is  also  seen  in  other  embedding 
algorithm  such  as  KPCA  (section  2.5).  Since  B  is  the  matrix  of  target  inner  product,  we  have 
B  ss  X,  where  X  =  [x]_, . . .  ,xn].  We  recover  X  by  finding  the  best  rank-d  approximation  for  B, 
which  can  be  obtained  via  the  eigen-deconrposition  of  B.  Let  X\, . . . ,  be  the  d  largest  eigenvalues 
of  B  with  corresponding  eigenvectors  v^, . . . ,  v(j.  We  have  X  =  [y^Apr^,  •  ■  • ,  y/X •  Here,  we 
assume  Aj  >  0  for  all  i  =  1, ...  ,d.  Unlike  Sannnon’s  mapping,  the  objective  function  for  the  optimal 
X  is  less  explicit:  it  is  the  sum  of  the  square  error  (squared  Frobenius  norm)  between  the  target 
inner  product  (bjj)  and  the  actual  inner  product  (xj  Xj). 

c\ 

One  drawback  of  ISOMAP  is  the  0(nz)  memory  requirement  for  storing  the  dense  matrix  of 
geodesic  distances.  Also,  solving  the  eigenvalue  problem  of  a  large  dense  matrix  is  relatively  slow. 
To  reduce  both  the  computational  and  memory  requirements,  landmark  ISOMAP  [55]  sets  apart  a 
subset  of  y  as  landmark  points  and  preserves  only  the  geodesic  distances  from  yt  to  these  landmark 
points.  A  similar  idea  has  been  applied  to  Sammon’s  mapping  before  [25].  A  continuum  version  of 
ISOMAP  has  also  been  proposed  [282].  ISOMAP  can  fail  when  there  is  a  “hole”  in  the  manifold 
[66].  We  also  want  to  note  that  an  exact  isometric  mapping  of  a  manifold  is  theoretically  possible 
only  when  the  manifold  is  “flat”,  i.e.,  when  the  curvature  tensor  is  zero,  as  pointed  out  in  [16]. 

To  summarize,  ISOMAP  consists  of  the  following  steps: 

1.  Construct  a  neighborhood  graph  using  either  the  e  neighborhood  or  the  fcnn  neighborhood. 

2.  Solve  the  all  pair  shortest  path  problem  on  the  neighborhood  graph  to  obtain  an  estimate  of 
the  geodesic  distances  g^j. 

1  9 

3.  Compute  A  =  {a^j},  where  a^j  =  and  B  =  HjjAHn- 

4.  The  d  largest  eigenvalues  and  the  corresponding  eigenvectors  of  B  are  found  and  X  = 

[\Alvl  ,...,^\psd]T  ■ 

2.7  Locally  Linear  Embedding 

In  locally  linear  embedding  (LLE)  [219,  226],  each  local  region  on  a  manifold  is  approximated  by  a 
linear  hyperplane.  LLE  maps  the  high  dimensional  data  points  into  a  low  dimensional  space  so  that 
the  local  geometric  properties,  represented  by  the  reconstruction  weights,  are  best  preserved. 

Specifically,  yt  is  reconstructed  by  its  projection  yj  on  the  hyperplane  H  passing  through  its 
neighbors  N( yj)  (defined  in  section  2.2).  Mathematically, 

yi~yi  =  ^wijyj’ 

j 

with  the  constraint  Wjj  =  1  to  reflect  the  translational  invariance  for  the  reconstruction.  By 
minimizing  the  sum  of  square  error  of  this  approximation,  we  can  also  achieve  invariance  for  rotation 
and  scaling.  The  weights  Wjj  reflect  the  local  geometric  properties  of  yj.  This  interpretation  on 
w^j,  however,  is  reasonable  only  when  yj  is  well  approximated  by  yj,  i.e.,  when  yj  is  close  to  H. 
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The  weights  are  found  by  solving  the  following  optimization  problem: 

min  IlYi— 5>yyjll2  subject  to  ^  uijj  =  1,  w^j  =  0  if  yj  £  iV(y^)  for  allL  (2.12) 

Kj}  j  j 

Now,  write  AT( y,j)  =  {y-q , . . . ,  y -q}  and  denote  =  yT^- .  Note  that  y?;  ^  IV(y^).  The  optimization 
problem  (2.12)  can  be  solved  efficiently  by  first  constructing  a  L  by  L  matrix  F  such  that  /  = 

(zj  -  x,j)J  (z^,  —  x?;).  Equivalently,  F  =  (Z  —  x^l-^  jJ1  (Z  —  x^l-j_  ^),  where  F  =  {/-^.},  1^  ^  is  a  1 
by  L  vector  with  all  entries  one,  and  Z  =  [zj, . . . ,  zd.  The  next  step  is  to  solve  the  equation 


Fu  =  1 


LI 


(2-13) 


1-7 

and  then  we  normalize '  the  solution  u  by  uj  =  Uj  /  uj  ■  The  values  of  Uj  are  assigned  to  the 
corresponding  w^j,  i.e. ,  w^T.  =  uj,  and  the  rest  of  w^j  are  set  to  zero.  Sometimes,  F  can  be 
singular.  This  can  happen  when  the  neighborhood  size  L  is  larger  than  D,  the  dimension  of  yt .  In 
this  case,  a  small  regularization  term  51^  is  added  to  F  before  solving  the  Equation  (2.13).  This 

o 

regularization  has  the  effect  of  preferring  values  of  Wy  with  small  Yhj  wij-  Finding  Uj  is  efficient 
because  only  small  linear  systems  of  equations  are  solved.  Note  that  Uj  can  be  negative  and  y?;  can 
be  outside  the  convex  hull  of  N( y,j). 

In  the  second  phase  of  LLE,  we  seek  X  =  [xj_, . . . ,  xn]  such  that  x?;  «  j  wijxj  >  and  xj  €  R^. 

rr i 

To  make  the  problem  well-defined,  additional  constraints  xj  =  0  and  xixi  =  are  needed. 
The  second  constraint  has  the  effect  of  both  fixing  the  scale  and  enforcing  different  features  in  x,( 
to  carry  independent  information  by  requiring  the  sample  covariances  between  different  variables  in 
x?;  to  be  zero.  The  optimization  problem  is  now 


(xi> 


-E 

j 


wijxj\ 


subject  to  ^  x?;  =  0  and 


Exixf  =1a 


(2.14) 


Note  the  similarity  between  Equations  (2.12)  and  (2.14).  Let  x(*)  denote  the  i-th  row  of  X.  Equation 
(2.14)  can  be  rewritten  as 


X 


trace(X(I  —  W)^(I  —  W)X^)  subject  to  =  0  and  x^^x^)  =  5jj.  (2.15) 


rp 

This  can  be  solved  by  eigen-decomposition  on  M  =  (I  —  W)J  (I  —  W).  Note  that  M  is  positive 
semi-definite.  Let  vj  be  the  eigenvector  corresponding  to  the  (j  +  l)-th  smallest  eigenvalue.  The 
optimal  X  is  given  by  X  =  [vj,... ,  v^]J  .  The  first  constraint  is  automatically  satisfied  because 

ln  i  is  the  eigenvector  of  M  with  eigenvalue  0.  This  eigenvalue  problem  is  relatively  easy  because 

T' 

M  is  sparse  and  can  be  represented  as  a  product  of  sparser  matrices  (I  —  W)J  and  (I  —  W). 

The  above  exposition  of  LLE  assumes  the  pattern  matrix  as  input.  LLE  can  be  modified  to  work 
with  a  dissimilarity  matrix  [226] .  There  is  also  a  supervised  extension  of  LLE  [53,  54] ,  which  uses 
the  class  labels  to  modify  the  neighborhood  structure.  The  kernel  trick  can  also  be  applied  to  LLE 
to  visualize  the  data  points  in  the  feature  space  [56].  The  case  when  LLE  is  applied  to  data  sets 
with  natural  clustering  structure  has  been  examined  in  [206] . 

In  summary,  LLE  includes  the  following  steps: 


'The  normalization  is  valid  because  Y'.  ? / =  1 1 .mF  1 1  rn  \  and  hence  Y2juj  cannot  be  zero,  by  the  positive 
definiteness  of  F  — 1 . 
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1.  Find  the  neighbors  of  each  y,(  according  to  either  e-neighborhood  or  him  neighborhood. 


2.  For  each  y?;,  form  the  matrix  F  and  solve  the  equation  Fu  =  1  y  -  After  normalizing  u  by 
uj  =  Uj/J2j  uji  set  wi,Tj  =  Uj  and  the  remaining  w^j  to  zero. 


3. 


rp 

Find  the  second  to  the  (d+l)-th  smallest  eigenvalues  of  (I— W) 1  (I— W)  by  a  sparse  eigenvalue 
solver  and  let  {v^, . . . ,  v^}  be  the  eigenvectors. 


4. 


Obtain  the  reduced  dimension  representation  by  X  =  [v^, . . . ,  v^] 


T 


2.8  Laplacian  Eigenmap 


The  approach  taken  by  Laplacian  eigenmap  [16]  for  nonlinear  mapping  is  different  from  those  of 
ISOMAP  and  LLE.  Laplacian  eigenmap  constructs  orthogonal  smooth  functions  defined  on  the 
manifold  based  on  the  Laplacian  of  the  neighborhood  graph.  It  has  its  roots  in  spectral  graph 
theory  [42]. 

As  in  ISOMAP,  a  neighborhood  graph  G  =  ( V ,  E)  is  first  constructed.  Unlike  ISOMAP,  where 
the  weight  w^j  of  the  edge  (vj,Vj)  represents  the  distance  between  and  vj ,  the  weight  in  Laplacian 
eigenmap  represents  the  similarity  between  Vj  and  Vj .  The  weight  w^j  can  be  set  by 


w^j  =  exp 


4 1 


2 


(2.16) 


with  t  as  an  algorithmic  parameter,  or  it  can  be  simply  set  to  one.  The  use  of  the  exponential 
function  to  transform  a  distance  value  to  a  similarity  value  can  be  justified  by  its  relationship  to  the 
heat  kernel  [16]. 

The  nonlinear  mapping  problem  is  recast  as  the  graph  embedding  problem  that  maps  the  vertices 
in  the  neighborhood  graph  G  to  R^.  The  first  step  is  to  find  a  “good”  function  /(.)  :  V  i— >  R  that 
maps  the  vertices  in  G  to  a  real  number.  Since  the  domain  of  /(.)  is  finite,  /(.)  can  be  represented 
by  a  vector  u,  with  =  Uj.  According  to  spectral  graph  theory,  the  smoothness  of  /  can  be 

defined  by 

S=  \Y!wij(ui  _uj)2'  (2-17) 

ij 

The  intuition  of  S  is  that,  for  large  w^j,  the  vertices  v ^  and  vj  are  “similar”  and  hence  the  difference 
between  f{vj)  and  f(vj)  should  be  small  if  /(.)  is  smooth.  A  smooth  mapping  /(.)  is  desirable 
because  a  faithful  embedding  of  the  graph  should  assign  similar  values  to  V;L  and  Vj  when  they  are 
close.  We  can  rewrite  S  as 

S  =  l  ZlK jul  +  wiju)  ~  2uiuj ) 

ij 

=  2  (1C  “I  5Z  wij  +  uj  H  wij  ~  2  5Z  wijuiuj)  (2.18) 

i  j  j  i  ij 

Y.  H  wij  -  Y  wijuiuj  =  u  /  Lu’ 
i  j  ij 

where  L  is  the  graph  Laplacian  defined  by  L  =  D  —  W,  W  =  [My  }  is  the  graph  weight  matrix,  and 
D  is  a  diagonal  matrix  with  =  'jEj  w;L  j .  The  matrix  L  can  be  thought  of  as  the  Laplacian  operator 
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on  functions  defined  on  the  graph.  Since  d aa  can  be  interpreted  as  the  importance  of  va .  the  natural 

rn 

inner  product  between  two  functions  /]_(.)  and  /2(.)  defined  on  the  graph  is  Du2- 

Because  the  constant  function  is  the  smoothest  and  is  uninteresting,  we  seek  /(.)  to  be  as  smooth 
as  possible  while  being  orthogonal  to  the  constant  function.  The  norm  of  /(.)  is  constrained  to  be 
one  to  make  the  problem  well-defined.  Thus  we  want  to  solve 

' I  1  'I  1  '  I  1 

minu  Lu  subject  to  u  Du  =  1  and  u  Dln  i  =  0.  (2.19) 

This  can  be  done  by  solving  the  generalized  eigenvalue  problem 


Lu  =  ADu,  (2.20) 

after  noting  that  ln  ^  is  a  solution  to  Equation  (2.20)  with  A  =  0.  Here,  ln  i  denotes  a  n  by 
1  vector  with  all  entries  one.  As  L  is  positive  semi-definite,  the  eigenvector  corresponding  to  the 
second  smallest  eigenvalue  of  Equation  (2.20)  yields  the  desired  /(.).  In  general,  d  orthogonal^ 
functions  {/]_(.), . . . ,  fd(-)}  that  are  as  smooth  as  possible  are  sought  to  map  the  vertices  to  R^. 
The  functions  can  be  obtained  by  the  eigenvectors  corresponding  to  the  second  to  the  (d  +  l)-th 
smallest  eigenvalues  in  Equation  (2.20).  The  low  dimensional  representation  of  y%  is  then  given  by 

=  (/lK)>  /2(wi)>  •  •  • .  fd(vi))T-  In  matrix  form,  X  =  [up, . , . ,  ud]T . 

The  embedding  problem  of  the  neighborhood  graph  and  the  embedding  problem  of  the  points 
in  the  manifold  is  related  in  the  following  way.  A  smooth  function  /(.)  that  maps  the  point  y,(  in 
the  manifold  to  x,t  £  R^  is  preferable,  because  a  faithful  mapping  should  give  similar  values  (small 
||x^—  Xj||)  to  y,j  and  y j  when  ||y,j  —  y^||  is  small.  A  small  | |y^  —  yj||  corresponds  to  a  large  Wjj  in  the 
graph.  Thus,  intuitively,  a  smooth  function  defined  on  the  graph  corresponds  to  a  smooth  function 
defined  on  the  manifold.  In  fact,  this  relationship  can  be  made  more  rigorous,  because  the  graph 
Laplacian  is  closely  related  to  the  Laplace-Beltrami  operator  on  the  manifold,  which  in  turn  is  related 
to  the  smoothness  of  a  function  defined  on  the  manifold.  The  eigenvectors  of  the  graph  Laplacian 
correspond  to  the  eigenfunctions  of  the  Laplace-Beltrami  operator,  and  the  eigenfunctions  with  small 
eigenvalues  provide  a  “smooth”  basis  of  the  functions  defined  on  the  manifold.  The  neighborhood 
graph  used  in  Laplacian  eigenmap  can  thus  be  viewed  as  a  discretization  tool  for  computation  on 
the  manifold. 

There  is  also  a  close  relationship  between  Laplacian  eigenmap  and  spectral  clustering.  In  fact,  the 
spectral  clustering  algorithm  in  [194]  is  almost  the  same  as  first  performing  Laplacian  eigenmap  and 
then  applying  fc-means  clustering  on  the  low  dimensional  feature  vectors.  The  manifold  structure 
discovered  by  Laplacian  eigenmap  can  also  be  used  to  train  a  classifier  in  a  semi-supervised  setting 
[182].  The  Laplacian  of  a  graph  can  also  lead  to  an  interesting  kernel  function  (as  in  SVM)  for  vertices 
in  a  graph  [154].  This  idea  of  nonlinear  mapping  via  graph  embedding  has  also  been  extended  to 
learn  a  linear  mapping  [109]  as  well  as  generalized  to  the  case  when  a  vector  is  associated  with  each 
vertex  in  the  graph  [31]. 

To  sum  up,  the  steps  for  Laplacian  eigenmap  include: 

1.  Construct  a  neighborhood  graph  of  y  by  either  the  e- neighborhood  or  the  fcnn  neighborhood. 

2.  Compute  the  edge  weight  w^j  by  either  exp(||y,j  —  y?- 1 |z/(4i)),  or  simply  set  Wjj  to  1. 

3.  Compute  D  and  the  graph  Laplacian  L. 


8 Orthogonality  is  preferred  as  it  suggests  the  independence  of  information.  Also,  in  PCA,  each  of  the  extracted 
features  is  orthogonal  to  the  others. 
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4.  Find  the  second  to  the  (d  +  l)-th  smallest  eigenvalues  in  the  generalized  eigenvalue  problem 
Lu  =  ADu  and  denote  the  eigenvectors  by  u-p . . . ,  u^.  The  low  dimensional  feature  vectors 

f~r 

are  given  by  X  =  [uj , . . . ,  u^] 1  . 

2.9  Global  Co-ordinates  via  Local  Co-ordinates 

Recall  that  in  section  2.2,  an  atlas  of  a  manifold  A4  is  defined  as  a  collection  of  co-ordinate  charts 
that  covers  the  entire  A4,  and  overlapping  charts  can  be  “connected”  smoothly.  This  idea  has 
inspired  several  nonlinear  mapping  algorithms  [220,  29,  247]  which  construct  different  local  charts 
and  join  them  together. 

There  are  two  stages  in  these  type  of  algorithms.  First,  different  local  models  are  fitted  to  the 
data,  usually  by  the  means  of  a  mixture  model.  Each  local  model  gives  rise  to  a  local  co-ordinate 
system.  A  local  model  can  be,  for  example,  a  Gaussian  or  a  factor  analyzer.  Let  zls  be  the  local 
co-ordinate  given  to  yt  by  the  s-th  local  co-ordinate  system.  Let  r^s  denote  the  suitability  of  using 
the  s-th  local  model  for  y;r  We  require  r^s  >  0  and  J2gris  ~  1-  The  introduction  of  r^s  can 
represent  the  fact  that  only  a  small  number  of  local  models  are  meaningful  for  each  y?; .  Typically, 
r^s  is  obtained  as  the  posterior  probability  of  the  s-th  local  model,  given  y,p 

In  the  second  stage,  different  local  co-ordinates  of  y?;  are  combined  to  give  a  global  co-ordinate. 
Let  gis  be  the  global  co-ordinate  of  yt  due  to  the  s-th  local  model,  and  let  gj  €  be  the 
corresponding  “combined”  global  co-ordinate.  In  the  three  papers  we  have  considered  here,  gjs  is 
simply  the  affine  transform  of  the  local  co-ordinate,  gjs  =  LsZjs.  Here,  Zjs  is  the  “augmented”  z,-^, 
ZjS  =  [z‘^ ,  1]J  .  Lg  is  the  (unknown)  transformation  matrix  with  d  rows  for  the  s-th  local  model. 
Note  that  it  is  desirable  for  neighboring  local  models  to  be  “similar”  so  that  the  global  co-ordinates 
are  more  consistent.  An  important  characteristic  of  the  algorithms  in  this  section  is  that,  unlike 
ISOMAP,  LLE,  or  Laplacian  eigenmap,  extension  for  a  point  y  that  is  outside  the  training  data  y 
is  easy  after  computing  zs  and  rg  for  different  s. 


2.9.1  Global  Co-ordination 

In  the  global  co-ordination  algorithm  in  [220],  the  first  and  the  second  stages  are  performed  simul¬ 
taneously  by  the  variational  method.  The  first  stage  is  done  by  fitting  a  mixture  of  factor  analyzers. 
Under  the  s-th  local  model,  a  data  point  is  modeled  by 

y  i  =  Ms  +  A-s^is  +  e^g,  (2-21) 

where  fis  is  the  mean,  As  is  the  factor  loading  matrix,  and  els  is  the  noise  that  follows  Af( 0,  4'g), 
a  multivariate  Gaussian  with  mean  0  and  covariance  4/g.  By  the  definition  of  factor  analyzer, 
is  diagonal.  The  hidden  variable  Zjs  is  assumed  to  follow  7V(0,I).  The  scale  of  z^s  is  unimportant 
because  it  can  be  absorbed  by  the  factor  loading  matrix.  Let  ag  be  the  prior  probability  of  the  s-th 
factor  analyzer.  The  parameters  are  {as,  Rs>  A-s,  ^s},  and  the  data  density  is  given  by 

P(yi)  =  Yl  /  P(yi\s>zis)P(zis\s)p(s)dzis 

s  Jzig 

=  y^Qg(27r)~-P/2(det(AgA|1  +  4's))_1^2  (2.22) 

s 

exp(-i(yj  -  Rs)T( AsAj  +  tfs)-1(y*  -  Rs))  • 
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We  define  r^s  as  the  posterior  probability  of  the  s-th  local  model  given  y?;,  P(s |y7;),  and  it  can 
be  computed  based  on  Equation  (2.22).  Equation  (2.22)  also  gives  rise  to  p(z?;s|s,y,j)  and  hence 
p(Sis\s,yi),  because  gjs  is  a  function  of  z^s  and  Lg.  The  posterior  probability  of  the  global  co¬ 
ordinate  is  defined  as 

P(SiiYi)  =  '52P(s\yi)p(gis\s,yi).  (2.23) 

s 

Equation  (2.23)  assumes  that  the  overall  global  co-ordinate  g.j  is  selected  among  different  gls.  with 
s  stochastically  selected  according  to  the  posterior  probability  of  the  s-th  model.  In  the  case  where 
y j  is  likely  to  be  generated  either  by  the  j-th  or  the  fc-th  local  model,  the  corresponding  global 
co-ordinates  gjj  and  g,^.  should  be  similar.  This  implies  that  the  posterior  density  p(gj|y.j)  should 
be  unimodal.  Enforcing  the  unimodality  of  p(gj|y.j)  directly  is  difficult.  So,  the  authors  in  [220] 
instead  drive  p(g2;|yj)  to  be  as  similar  to  a  Gaussian  distribution  as  possible  by  adding  an  extra 
term  to  the  log-likelihood  objective  function  to  be  maximized: 

$  =  ^log^y.j)  -  ^T>A-L(g(g,j,s|y,j)||p(g2;,s|y?:)).  (2.24) 

i  is 

Here,  is  the  Kullback-Leibler  divergence  defined  as 

dKL(Q\\P)  =  j  Q(y)log|^|  dy ,  (2.25) 

and  g(g,j,s|y,j)  is  assumed  to  be  factorized  as 

9(gps|yi)  =  5*(g*lx*)5i0|y*) 

with  <?j(gj|yj)  as  a  Gaussian  and  9j(s|yj)  as  a  multinomial  distribution.  This  addition  of  a  divergence 
term  between  a  posterior  distribution  and  a  factorized  distribution  is  commonly  seen  in  the  literature 
on  the  variational  method.  The  objective  function  in  Equation  (2.24)  can  be  maximized  by  an  EM- 
type  algorithm,  which  estimates  the  parameters  {ag,  /rs,  A$,  ^g,  Ls}  as  well  as  the  parameters  for 
(Ijjg.jlyj)  and  q^(s |y,j).  Since  the  first  and  the  second  stages  are  carried  out  simultaneously,  local 
models  that  lead  to  consistent  global  co-ordinates  are  implicitly  favored. 


2.9.2  Charting 

For  the  charting  algorithm  in  [29],  the  first  and  the  second  stages  are  performed  separately.  This 
decoupling  decreases  the  complexity  of  the  optimization  problem  and  can  reduce  the  chance  of 
getting  trapped  in  poor  local  minima.  In  the  first  stage,  a  mixture  of  Gaussians  is  fitted  to  the  data, 

P{ y)  =  X]  asA/"(^g,  Eg),  (2.26) 

s 

with  the  constraint  that  two  adjacent  Gaussians  should  be  “similar”.  This  is  achieved  by  using  a 
prior  distribution  on  the  mean  vectors  and  the  covariance  matrices  that  encourages  the  similarity  of 
adjacent  Gaussians: 

P({MgMss})  oc  exp  (-E  E  *s(f*j)DKLW(f*s,Xs)\Mf*j,'Ej))),  (2-27) 

s  j,j¥=s 
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where  A s{Hj)  measures  the  closeness  between  the  locations  of  the  s-th  and  the  j-th  Gaussian  com- 

r\  r\ 

ponents.  It  is  set  to  A s{Hj)  oc  exp(—  \\fis  — /ij||z/(2<rz)),  where  a  is  a  width  parameter  determined 
according  to  the  neighborhood  structure.  The  prior  distribution  also  makes  the  parameter  estima¬ 
tion  problem  more  well-conditioned.  In  practice,  n  Gaussian  components  are  used,  with  the  center  of 
the  i-th  component  /J,t  set  to  y%  and  the  weight  of  each  component  set  to  1/n.  The  only  parameters 
to  be  estimated  are  the  covariance  matrices.  The  MAP  estimate  of  the  covariance  matrices  can  be 
shown  to  satisfy  a  set  of  constrained  linear  equations  and  they  are  obtained  by  solving  this  set  of 
equations. 

In  the  second  stage,  the  local  co-ordinate  Zjs  is  first  obtained  as  z^s  =  VJ  (x^  —  fis),  where  V 
consists  of  the  d  leading  eigenvectors  of  Ss.  We  can  regard  zjs  as  the  feature  extracted  from 
using  PCA  on  the  s-th  local  model.  The  local  model  weight  r^s  is,  once  again,  set  to  the  posterior 
probability  of  the  s-th  local  model  given  y% .  The  transformation  matrices  L  are  found  by  solving 
the  following  weighted  least  square  problem: 


min  Y, 

{Ls}i,j,k 


(2.28) 


i)  O  rp 

Here,  ||X||^,  denotes  the  square  of  the  Frobenius  norm,  ||X||^,  =  trace(XJ  X).  Intuitively,  we  want 
to  find  the  transformation  matrices  such  that  the  global  co-ordinates  due  to  different  local  models  are 
the  most  consistent  in  the  least  square  sense,  weighted  by  the  importance  of  different  local  models. 

Equation  (2.28)  can  be  solved  as  follow.  Let  K  and  h  be  the  number  of  local  models  and  the 
length  of  the  augmented  local  co-ordinate  z^s,  respectively.  Define  Z g  =  [z^s,  •  •  •  ,zns]  as  the  h  by 
n  matrix  of  local  co-ordinates  using  the  s-th  local  model  for  all  the  data  points.  Define  the  Kh 
by  n  matrix  Tg  by  Tg  =  [0n  fg-ij/pZg  , °n,(K-s)h 1  >  where  0 n,m  denotes  a  zero  matrix  with 
size  n  by  m.  Let  Pg  be  a  n  by  n  diagonal  matrix  where  the  (i,i)-th  entry  is  r,;„.  The  solution 
to  Equation  (2.28)  is  given  by  the  d  trailing  eigenvectors  of  the  Kh  by  Kh  matrix  QQ1  ,  where 
Q  =  Yj  Yk  —  Tj,)P  jPk).  Note  that  the  second  stage  is  independent  of  the  first  stage.  In 

particular,  alternative  collection  of  local  models  can  be  used,  as  long  as  Zjs  and  r^s  can  be  calculated. 


2.9.3  LLC 

The  LLC  algorithm  described  in  [247]  concerns  the  second  stage  only.  Given  the  local  co-ordinates 
z?;s  and  the  model  confidences  r;LS  computed  from  the  first  stage,  the  LLC  algorithm  finds  the  best 
Ls  such  that  the  local  geometric  properties  are  best  preserved  in  the  sense  of  the  LLE  loss  function. 
The  global  co-ordinate  g j  is  assumed  to  be  a  weighted  sum  in  the  form 

=  G.sSi.s  =  'y  ]  r?’.sL.sz.„;.s.  (2.29) 

s  s 

Suppose  there  are  K  local  models,  each  of  which  gives  a  local  co-ordinate  z,js  in  a  ft  -  1 
dimensional  space®.  We  stack  z jsrj.s  for  different  s  to  get  a  vector  of  length  Kh,  Uj  = 
'  ,riKiK^  ’  an<^  concatenate  different  Ls  to  form  a  d  by  Kh  matrix  J  = 

[L^,L2, ,Ljf].  (Each  Ls  is  of  size  d  by  h.)  Equation  (2.29)  can  be  rewritten  as  gj  =  Ju?;. 

The  global  co-ordinate  matrix,  G  =  (g^, . . .  ,gn)>  is  thus  given  by  G  =  JU,  where  U  is  a  Kh  by 
n  matrix  U  =  [up  . . . ,  un,]-  Denote  the  i-th  row  of  J  by  j(*).  If  we  substitute  G  as  Y  in  the  LLE 


9 In  general,  different  local  models  can  give  local  co-ordinates  with  different  lengths,  as  emphasized  in  [247].  Here 
we  assume  a  common  h  for  the  ease  of  notation. 
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objective  function  in  equation  (2.15),  we  have 


min  trace(JU(I  -  W)T(I  -  W)UTJT) 
subject  to  j^Ul?l  i  =  0  and  =  Sjj, 


(2.30) 


where  W  is  defined  in  the  same  way  (the  neighborhood  reconstruction  weight)  as  in  section  2.7. 
Here,  ln  ^  denotes  a  n  by  1  vector  with  all  entries  one.  Note  that  obtaining  W  is  efficient  (see 

section  2.7  for  details).  The  value  of  j(*)  can  be  obtained  as  the  solution  of  the  generalized  eigenvalue 
problem  (U(I  —  W)-^(I  —  W)U^)v  =  A(UU^)v.  The  authors  in  [247]  claim  that  the  j(*)  thus 
obtained  satisfies  the  constraint  j(*)uim  ^  =  0  automatically  because  Ulm  i  is  an  eigenvector  of 
the  generalized  eigenvalue  problem  with  eigenvalue  0.  However,  this  is  not  true  in  general.  In  any 
case,  the  authors  in  [247]  use  the  eigenvectors  corresponding  to  the  second  to  the  (d+  l)-th  smallest 
eigenvalues  as  the  solution  of  J.  Note  that  this  generalized  eigenvalue  problem  is  about  a  Kh  by 
Kh  matrix,  instead  of  a  large  n  by  matrix  in  the  original  LLE.  After  fording  j(*),  J  and  hence  L§ 
are  reconstructed.  The  global  co-ordinate  is  obtained  via  equation  (2.29). 

The  idea  of  this  algorithm  is  somewhat  analogous  to  the  locality  preserving  projection  (LPP) 
algorithm  [109].  LPP  simplifies  the  eigenvalue  problem  by  the  extra  information  that  the  projection 
should  be  linear,  whereas  the  current  algorithm  simplifies  the  eigenvalue  problem  by  the  given 
mixture  model. 


2.10  Experiments 

We  applied  some  of  these  algorithms  on  three  synthetic  3D  data  sets.  The  data  manifold  and  the 
data  points  can  be  seen  in  Figure  2.5.  The  first  data  set,  parabolic,  consists  of  2000  randomly 
sampled  data  points  lying  on  a  paraboloid.  It  is  an  example  of  a  nonlinear  manifold  with  a  simple 
analytic  form  -  a  second  degree  polynomial  in  the  co-ordinates  in  this  case.  The  second  data  set  swiss 
roll  and  the  third  data  set  S-curve  are  commonly  used  for  validating  manifold  learning  algorithms. 
Again,  2000  points  are  randomly  sampled  from  the  “Swiss  roll”  and  the  S-shaped  surface  to  create 
the  data  sets,  respectively.  KPCA,  ISOMAP,  LLE,  and  Laplacian  eigenmap  were  run  on  these  3D 
data  sets  to  project  the  data  to  2D.  We  have  implemented  KPCA  and  Laplacian  eigenmap  ourselves, 
while  the  implementations  for  ISOMAP^®  and  LLE ^  ^  were  downloaded  from  their  respective  web 
sites.  For  ISOMAP,  LLE,  and  Laplacian  eigenmap,  knn  neighborhood  with  k  =  12  is  used.  The 
edge  weight  is  set  to  one  for  Laplacian  eigenmap.  For  KPCA,  polynomial  kernel  with  degree  2  is 
used.  For  comparison,  the  standard  PCA  and  Sannnon’s  mapping  were  also  performed  on  these 
data  sets.  Sammon’s  mapping  is  initialized  by  the  result  of  PCA. 

The  results  of  these  algorithms  can  be  seen  in  Figures  2.6,  2.7,  and  2.8.  The  data  points  are 
colored  differently  to  visualize  their  locations  on  the  manifold.  We  intentionally  omit  the  “goodness- 
of-fits”  or  “error”  on  the  projection  results,  because  the  criteria  used  by  different  algorithms  (Sam¬ 
mon’s  stress  in  Sammon’s  mapping,  correlation  of  distances  in  ISOMAP,  reconstruction  error  in  LLE, 
residue  variance  in  PCA  and  KPCA,  to  name  a  few)  are  very  different  and  it  can  be  misleading  to 
compare  them. 

For  the  parabolic  data  set,  we  can  see  in  Figures  2.6(b)  and  2.6(c)  that  both  ISOMAP  and  LLE 
recover  the  intrinsic  co-ordinates  very  well,  because  the  changes  in  the  color  of  the  data  points  after 


10ISOMAP  web  site:  http://stanford.isomap.edu 
nLLE  web  site:  http://www.cs.toronto.edu/~roweis/lle/ 
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embedding  are  smooth.  Since  this  manifold  is  quadratic,  we  expect  that  KPCA  with  a  quadratic 
kernel  function  should  also  recover  the  true  structure  of  the  data.  It  turns  out  that  the  first  two 
kernel  principal  components  cannot  lead  to  a  clean  mapping  of  the  data  points.  Instead,  the  second 
and  the  third  kernel  principal  components  extract  the  structure  of  the  data  (Figure  2.6(a)).  The 
first  two  features  extracted  by  Laplacian  eigenmap  cannot  recover  the  desired  trend  in  the  data.  The 
target  structure  with  slight  distortion  can  be  recovered  if  the  second  and  the  third  extracted  features 
are  used  instead  (Figure  2.6(d)).  PCA  and  Sannnon’s  mapping  cannot  recover  the  structure  of  this 
data  set  (Figures  2.6(e)  and  2.6(f)).  The  similarity  of  the  results  of  PCA  and  Sammon’s  mapping 
can  be  attributed  to  the  fact  that  Sannnon’s  mapping  is  initialized  by  the  PCA  solution.  The  initial 
PCA  solution  is  already  a  good  solution  with  respect  to  Sammon’s  stress  for  this  low-dimensional 
data  set. 

For  the  data  set  swiss  roll ,  we  can  see  from  Figures  2.7(b)  and  2.7(c)  that  ISOMAP  and  LLE 
performed  a  good  job  “unfolding”  the  manifold.  For  Laplacian  eigenmap,  once  again,  the  first  two 
extracted  features  cannot  be  interpreted  easily,  though  the  structure  of  the  data  set  is  revealed  if  the 
second  and  the  third  features  are  used  (Figure  2.7(d)).  KPCA  cannot  recover  the  intrinsic  structure 
of  the  data  set  no  matter  which  kernel  principal  component  is  used.  An  example  of  the  poor  result  of 
KPCA  is  shown  in  Figure  2.7(a).  PCA  and  Sannnon’s  mapping  also  cannot  recover  the  underlying 
structure  (Figures  2.7(e)  and  2.7(f)).  The  results  for  the  third  data  set  S-curve  (Figure  2.8)  are 
similar  to  those  of  swiss  roll ,  with  the  exception  that  Laplacian  eigenmap  can  recover  the  desired 
structure  using  the  first  two  extracted  features. 

In  addition  to  these  synthetic  data  sets,  we  have  also  tested  these  nonlinear  mapping  algorithms 
on  a  high-dimensional  real  world  data  set:  the  face  images  used  in  [175]  The  task  here  is  to  classify  a 
64  by  64  face  image  in  this  data  set  as  either  the  “Asian  class”  or  the  “non- Asian  class” .  This  data 
set  will  be  described  in  more  details  in  Section  3.3.  The  results  of  mapping  these  4096D  data  points 
to  3D  can  be  seen  in  Figure  2.9.  Data  points  from  the  two  classes  are  shown  in  different  colors.  The 
(training)  error  rates  using  quadratic  discriminant  analysis  are  also  computed  for  different  mappings. 
As  we  can  see  from  Figures  2.9(a),  2.9(d),  2.9(e)  and  2.9(f),  the  mapping  results  by  Laplacian 
eigenmap,  KPCA,  PCA  and  Sammon’s  mapping  are  not  very  useful.  The  two  classes  are  not  well- 
separated,  and  the  error  rates  are  also  high.  ISOMAP  maps  the  two  classes  more  separately  and  has 
smaller  error  rates  (Figure  2.9(b)).  For  LLE  (Figure  2.9(c)),  although  the  mapping  results  look  a 
bit  unnatural,  the  error  rate  turns  out  to  be  the  smallest,  indicating  the  two  classes  are  reasonably 
separated.  It  should  be  noted  that  the  intrinsic  dimensionality  of  this  data  set  is  probably  higher 
than  3.  So,  mapping  the  data  to  3D,  while  good  for  visualization,  can  lose  some  information  and  is 
suboptimal  for  classification. 

From  these  experiments,  we  can  see  that  both  ISOMAP  and  LLE  recover  the  intrinsic  structure 
of  the  data  sets  well.  The  performance  of  Laplacian  eigenmap  is  less  satisfactory.  We  have  attempted 
to  set  the  edge  weight  by  the  exponential  function  of  distances  (Equation  (2.16))  instead  of  one,  but 
the  preliminary  results  suggest  that  a  good  choice  of  the  width  parameter  t  is  hard  to  obtain.  The 
standard  PCA  and  Sammon’s  mapping  cannot  recover  the  target  structure  of  the  data.  It  is  not 
surprising,  because  PCA  is  a  linear  algorithm  and  the  underlying  structure  of  the  data  cannot  be 
reflected  by  any  linear  function  of  the  features.  For  Sammon’s  mapping,  it  does  not  give  very  good 
results  because  Sammon’s  mapping  is  “global” ,  meaning  that  the  relationship  between  all  pairs  of 
data  points  in  the  3D  space  is  considered.  Local  properties  of  the  manifold  cannot  be  modeled.  The 
reason  for  the  failure  of  KPCA  is  that  the  parametric  representation  of  the  manifold  for  swiss  roll 
and  S-curve  and  the  face  images  is  hard  to  obtain,  and  is  certainly  not  quadratic.  So,  the  assumption 
in  KPCA  is  violated  and  this  leads  to  poor  results. 
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2.11  Summary 


In  this  chapter,  we  have  described  different  approaches  for  nonlinear  mapping  based  on  fairly  differ¬ 
ent  principles.  The  algorithms  ISOMAP,  LLE,  and  Laplacian  eigenmap  are  non-iterative  and  require 
mainly  eigen-decomposition,  which  is  well  understood  with  many  off-the-shelf  algorithms  available. 
ISOMAP,  LLE,  and  Laplacian  eigenmap  are  basically  non-parametric  algorithms.  While  this  pro¬ 
vides  extra  flexibility  to  model  the  manifold,  more  data  points  are  needed  to  give  a  good  estimate 
of  the  low  dimensional  vector.  The  basic  version  of  some  of  the  algorithms  (Sammon’s  mapping, 
ISOMAP,  LLE,  and  Laplacian  eigenmap)  cannot  generalize  the  mapping  to  patterns  outside  the 
training  set  y ,  though  an  out-of-sample  extension  has  been  proposed  [17]. 

There  are  interesting  connections  between  some  of  these  algorithms.  ISOMAP,  LLE,  and  Lapla¬ 
cian  Eigenmap  can  be  shown  to  be  the  special  cases  of  KPCA  [105].  The  matrix  M  in  LLE  can  be 
shown  to  be  related  to  the  square  of  the  Laplacian  Beltrami  operator  [16],  an  important  concept 
in  Laplacian  eigenmap.  While  these  techniques  have  been  successfully  applied  to  high  dimensional 
data  sets  like  face  images,  digit  images,  texture  images,  motion  data,  and  textual  data,  the  relative 
merits  of  these  algorithms  in  practice  are  still  not  clear.  More  comparative  studies  like  the  one  in 
[196]  would  be  helpful. 
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(a)  parabolic,  the  manifold 


(b)  parabolic,  the  data 


(c)  swiss  roll,  the  manifold 


(d)  swiss  roll,  the  data 
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(e)  S-curve,  the  manifold 


Figure  2.5:  Data  sets  used  in  the  experiments  for  nonlinear  mapping.  The  manifold  and  the  data 
points  are  shown.  The  data  points  are  colored  according  to  the  major  structure  of  the  data  as 
perceived  by  human. 
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PCA  Sammon 


Figure  2.6:  Results  of  nonlinear  mapping  algorithms  on  the  parabolic  data  set.  “2nd  and  3rd”  in 
the  captions  means  that  we  are  showing  the  second  and  the  third  components,  instead  of  the  first 
two. 
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Figure  2.7:  Results  of  nonlinear  mapping  algorithms  on  the  swiss  roll  data  set.  “2nd  and  3rd”  in 
the  captions  means  that  we  are  showing  the  second  and  the  third  components,  instead  of  the  first 
two. 
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(b)  ISOMAP 


(c)  LLE  (d)  Laplacian  Eigenmap 
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(e)  Standard  PCA 


(f)  Sammon’s  mapping 


Figure  2.8:  Results  of  nonlinear  mapping  algorithms  on  the  S-curve  data  set. 
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(e)  Standard  PCA,  36.4%  error  (f)  Sammon’s  mapping,  39.3%  error 


Figure  2.9:  Results  of  nonlinear  mapping  algorithms  on  the  face  images.  The  two  classes  (Asians 
and  non- Asians)  are  shown  in  two  different  colors.  The  (training)  error  rates  by  applying  quadratic 
discriminant  analysis  on  the  low  dimensional  data  points  are  shown  in  the  captions. 
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Chapter  3 


Incremental  Nonlinear 
Dimensionality  Reduction  By 
Manifold  Learning 


In  Chapter  2,  we  discussed  different  algorithms  to  achieve  dimensionality  reduction  by  nonlinear 
mapping.  Most  of  these  nonlinear  mapping  algorithms  operate  in  a  batch  mode  ,  meaning  that  all 
the  data  points  need  to  be  available  during  training.  In  applications  like  surveillance,  where  (image) 
data  are  collected  sequentially,  batch  method  is  computationally  demanding:  repeatedly  running 
the  “batch”  version  whenever  new  data  points  become  available  takes  a  long  time.  It  is  wasteful 
to  discard  previous  computation  results.  Data  accumulation  is  particularly  beneficial  to  manifold 
learning  algorithms  due  to  their  non-parametric  nature.  Another  reason  for  developing  incremental 
(non-batch)  methods  is  that  the  gradual  changes  in  the  data  manifold  can  be  visualized.  As  more  and 
more  data  points  are  obtained,  the  evolution  of  the  data  manifold  can  reveal  interesting  properties 
of  the  data  stream.  Incremental  learning  can  also  help  us  to  decide  when  we  should  stop  collecting 
data:  if  there  is  no  noticeable  change  in  the  learning  result  with  the  additional  data  collected,  there 
is  no  point  in  continuing.  The  intermediate  result  produced  by  an  incremental  algorithm  can  prompt 
us  about  the  existence  of  any  “problematic”  region:  we  can  focus  the  remaining  data  collection  effort 
on  that  region.  An  incremental  algorithm  can  be  easily  modified  to  incorporating  “forgetting”,  i.e. , 
the  old  data  points  gradually  lose  their  significance.  The  algorithm  can  then  adjust  the  manifold  in 
the  presence  of  the  drifting  of  data  characteristics.  Incremental  learning  is  also  useful  when  there 
is  an  unbounded  stream  of  possible  data  to  learn  from.  This  situation  can  arise  when  a  continuous 
invariance  transformation  is  applied  to  a  finite  set  of  training  data  to  create  additional  data  to  reflect 
pattern  invariance. 

In  this  chapter,  we  describe  a  modification  of  the  ISOMAP  algorithm  so  that  it  can  update  the 
low  dimensional  representation  of  data  points  efficiently  as  additional  samples  become  available. 
Both  the  original  ISOMAP  algorithm  [248]  and  its  landmark  points  version  [55]  are  considered. 
We  are  interested  in  ISOMAP  because  it  is  intuitive,  well  understood,  and  produces  good  mapping 
results  [133,  276].  Furthermore,  there  are  theoretical  studies  supporting  the  use  of  ISOMAP,  such 
as  its  convergence  proof  [18]  and  the  conditions  for  successful  recovery  of  co-ordinates  [66].  There  is 
also  a  continuum  extension  of  ISOMAP  [282]  as  well  as  a  spatio-temporal  extension  [133].  However, 


Mammon’s  mapping  can  be  implemented  by  a  feed-forward  neural  network  [180]  and  hence  can  be  made  online  if 
an  online  training  rule  is  used. 


46 


the  motivation  of  our  work  is  applicable  to  other  mapping  algorithms  as  well. 

The  main  contributions  of  this  chapter  include: 

1.  An  algorithm  that  efficiently  updates  the  solution  of  the  all-pairs  shortest  path  problems. 
This  contrasts  with  previous  work  like  [193],  where  different  shortest  path  trees  are  updated 
independently. 

2.  More  accurate  mappings  for  new  points  by  a  superior  estimate  of  the  inner  products. 

3.  An  incremental  eigen-decomposition  problem  with  increasing  matrix  size  is  solved  by  subspace 
iteration  with  Ritz  acceleration.  This  differs  from  previous  work  [270]  where  the  matrix  size 
is  assumed  to  be  constant. 

4.  A  vertex  contraction  procedure  that  improves  the  geodesic  distance  estimate  without  additional 
memory. 

The  rest  of  this  chapter  is  organized  as  follows.  After  a  recap  of  ISOMAP  in  section  3.1,  the 
proposed  incremental  methods  are  described  in  section  3.2.  Experimental  results  are  presented  in 
section  3.3,  followed  by  discussions  in  section  3.4.  Finally,  in  section  3.5  we  conclude  and  describe 
some  topics  for  future  work. 


3.1  Details  of  ISOMAP 

The  basic  idea  of  the  ISOMAP  algorithm  was  presented  in  Section  2.6.  It  maps  a  high  dimensional 
data  set  y^, . . . ,  yn  in  R^  to  its  low  dimensional  counterpart  x-p  . . . ,  ~x.n  in  R^,  in  such  a  way  that 
the  geodesic  distance  between  y?;  and  y  j  on  the  data  manifold  is  as  close  to  the  Euclidean  distance 
between  x,j  and  x j  in  R^  as  possible.  In  this  section,  we  provide  more  algorithmic  details  on  how  the 
mapping  is  done.  This  also  defines  the  notation  that  we  are  going  to  use  throughout  this  chapter. 

The  ISOMAP  algorithm  has  three  stages.  First,  a  neighborhood  graph  is  constructed.  Let  A jj  be 
the  (Euclidean)  distance  between  y?;  and  y j.  A  weighted  undirected  neighborhood  graph  Q  =  ( V, ,  E ) 
with  the  vertex  G  V  corresponding  to  yt  is  constructed.  An  edge  e(i,j)  between  v j  and  vj  exists 
if  and  only  if  y^  is  a  neighbor  of  y j,  i.e. ,  y,j  €  N(yj).  The  weight  of  e(i,  j),  denoted  by  w^j,  is  set 
to  A jj.  The  set  of  indices  of  the  vertices  adjacent  to  Vj  in  Q  is  denoted  by  adj(i). 

ISOMAP  proceeds  with  the  estimation  of  geodesic  distances.  Let  g^j  denote  the  length  of  the 
shortest  path  sp(i,j)  between  and  vj.  The  shortest  paths  are  found  by  the  Dijkstra’s  algorithm 
with  different  source  vertices.  The  shortest  paths  can  be  stored  efficiently  by  the  predecessor  matrix 
7r,jj,  where  ir^j  =  k  if  vj^  is  immediately  before  Vj  in  sp(i,j).  If  there  is  no  path  from  Vj  to  i>j,  irjj 
is  set  to  0.  Conceptually,  however,  it  is  useful  to  imagine  a  shortest  path  tree  T(i),  where  the  root 
node  is  v ^  and  sp(i,j)  consists  of  the  tree  edges  from  v ^  to  Vj.  The  subtree  of  T(i)  rooted  at  v<j  is 
denoted  by  T(i;  a).  Since  g,pj  is  the  approximate  geodesic  distance  between  y?;  and  yj,  we  shall  call 
g^j  the  “geodesic  distance”.  Note  that  G  =  {gpj}  is  a  symmetric  matrix. 

Finally,  ISOMAP  recovers  x„;  by  using  the  classical  scaling  [49]  on  the  geodesic  distance.  Define 
X  =  [xj,...,x/j].  Compute  B  =  — 1/2HGH,  where  H  =  {h^j},  h^j  =  S.^j  —  1/n  and  S.^j  is 
the  delta  function,  i.e.,  Sjj  =  1  if  i  =  j  and  0  otherwise.  The  entries  g.^j  of  G  are  simply  <7 A. 

nr 

We  seek  X  to  be  as  close  to  B  as  possible  in  the  least  square  sense.  This  is  done  by  setting 
X  =  [^/Aj"vj  . . .  where  Aj, . . . ,  are  the  d  largest  eigenvalues  of  B,  with  corresponding 

eigenvectors  , . . . ,  v^. 
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3.2  Incremental  Version  of  ISOMAP 


The  key  computation  in  ISOMAP  involves  solving  an  all-pairs  shortest  path  problem  and  an  eigen- 
decomposition  problem.  As  new  data  arrive,  these  quantities  usually  do  not  change  much:  a  new 
vertex  in  the  graph  often  changes  the  shortest  paths  among  only  a  subset  of  the  vertices,  and  the 
simple  eigenvectors  and  eigenvalues  of  a  slightly  perturbed  real  symmetric  matrix  stay  close  to  their 
original  values.  This  justifies  the  reuse  of  the  current  geodesic  distance  and  co-ordinate  estimates 
for  update.  We  restrict  our  attention  to  fcnn  neighborhood,  since  e-neighborhood  is  awkward  for 
incremental  learning:  the  neighborhood  size  should  be  constantly  decreasing  as  additional  data 
points  become  available. 

The  problem  of  incremental  ISOMAP  can  be  stated  as  follows.  Assume  that  the  low  dimensional 
co-ordinates  x,j  of  y?;  for  the  first  n  points  are  given.  We  observe  the  new  sample  yn\\.  How  should 
we  update  the  existing  set  of  Xj  and  find  xn_|_^?  Our  solution  consists  of  three  stages.  The  geodesic 
distances  <jj,j  are  first  updated  in  view  of  the  change  of  neighborhood  graph  due  to  .  The 

geodesic  distances  of  the  new  point  to  the  existing  points  are  then  used  to  estimate  xn  i -p  Finally, 
all  x,j  are  updated  in  view  of  the  change  in  gtj . 

In  section  3.2.1,  we  shall  describe  the  modification  of  the  original  ISOMAP  for  incremental 
updates.  A  variant  of  ISOMAP  that  utilizes  the  geodesic  distances  from  a  fixed  set  of  points 
(landmark  points)  [55]  is  modified  to  become  incremental  in  section  3.2.2.  Because  ISOMAP  is 
non-parametric,  the  data  points  themselves  need  to  be  stored.  Section  3.2.3  describes  a  vertex 
contraction  procedure,  which  improves  the  geodesic  distance  estimate  with  the  arrival  of  new  data 
without  storing  the  new  data.  This  procedure  can  be  applied  to  both  the  variants  of  ISOMAP. 
Throughout  this  section  we  assume  d  (dimensionality  of  the  projected  space)  is  fixed.  This  can  be 
estimated  by  analyzing  either  the  spectrum  of  the  target  inner  product  matrix  or  the  residue  of  the 
low  rank  approximation  as  in  [248],  or  by  other  methods  to  estimate  the  intrinsic  dimensionality  of 
a  manifold  [143,  171,  47,  35,  33,  259,  207], 


3.2.1  Incremental  ISOMAP:  Basic  Version 

We  shall  modify  the  original  ISOMAP  algorithm  [248]  (summarized  in  section  3.1)  to  become  in¬ 
cremental.  Details  of  the  algorithms  as  well  as  an  analysis  of  their  time  complexity  are  given  in 
Appendix  A.  Throughout  this  section,  the  shortest  paths  are  represented  by  the  more  economical 
predecessor  matrix,  instead  of  multiple  shortest  path  trees  T(«). 


3. 2. 1.1  Updating  the  Neighborhood  Graph 

Let  A  and  V  denote  the  set  of  edges  to  be  added  and  deleted  after  inserting  to  the  neighborhood 
graph,  respectively.  An  edge  e(i,n  +  1)  should  be  added  if  (i)  Uj  is  one  of  the  k  nearest  neighbors  of 
Ur+1’  or  (ii)  vn+ 1  replaces  an  existing  vertex  and  becomes  one  of  the  k  nearest  neighbors  of  Vj.  In 
other  words, 

A  =  {e(i,n+  1)  :  An_|_j g  <  or  A,n+ 1  —  (3-1) 

where  r?;  is  the  index  of  the  fc-th  nearest  neighbor  of  Vj. 

For  T>,  note  that  a  necessary  condition  to  delete  the  edge  e(i,j)  is  that  un_| replaces  v j  (vj )  as 
one  of  the  k  nearest  neighbors  of  vj  (uj).  So,  all  the  edges  to  be  deleted  must  be  in  the  form  e(i,  r,j) 
with  A?;  n_|_i  <  A?;  T..  The  deletion  should  proceed  if  V;L  is  not  one  of  the  k  nearest  neighbors  of  vT, ■ 
after  inserting  un_|_p  Therefore, 


V  (e(i,Tj)  :  A^r  >  A^n_|_^  and  AT  j  > 


(3.2) 
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1 

Input:  e(a,b),  the  edge  to  be  removed;  {gij};  {ntj} 

2 

Output:  Ffafi),  set  of  “affected”  vertex  pairs 

3 

Rab  ’■=  0;  Q.enqueue(a); 

4 

while  Q.notEmpty  do 

5 

t  :=  Q.pop;  Rab  =  Rab  U  {f}; 

6 

for  all  u  £  adj(t)  do 

7 

If  nub  =  a,  enqueue  u  to  Q; 

8 

end  for 

9 

end  while{Construction  of  Rab  finishes  when  the  loop  ends.} 

10 

F(a,b)  '■=  0; 

11 

Initialize  T' ,  the  expanded  part  of  T(a;  b),  to  contain  Vb  only; 

12 

for  all  u  G  Rab  do 

13 

Q.  enqueue  (6) 

14 

while  Q.  not  Empty  do 

15 

t  ■—  Q.pop; 

16 

if  tv  at  =  nut  then 

17 

F(a, b)  =  F(a,!>)  u  {(«,  t)}; 

18 

if  Vt  is  a  leaf  node  in  T'  then 

19 

for  all  va  in  adj(t)  do 

20 

Insert  va  as  a  child  of  vt  in  T'  if  nas  =  t 

21 

end  for 

22 

end  if 

23 

Insert  all  the  children  of  vt  in  T  to  the  queue  Q\ 

24 

end  if 

25 

end  while 

26 

end  for{V  u£Rabd  s£T(u;b),  sp(u,s)  uses  e(a,b).} 

Algorithm  3.1:  ConstructFab:  F^a  ^ ,  the  set  of  vertex  pairs  whose  shortest  paths  are  invalidated 


when  e(a,  b)  is  deleted,  is  constructed.  Ra ^  is  the  set  of  vertices  such  that  if  u  €  Ra the  shortest 
path  between  a  and  u  contains  e(a,  b). 

where  l ^  is  the  index  of  the  fc-th  nearest  neighbor  of  Uj-?-  after  inserting  vn_^i  in  the  graph.  Note 
that  we  have  assumed  there  is  no  tie  in  the  distances.  If  there  are  ties,  random  perturbation  can  be 
applied  to  break  the  ties. 

3. 2. 1.2  Updating  the  Geodesic  Distances 

The  deleted  edges  can  break  existing  shortest  paths,  while  the  added  edges  can  create  improved 
shortest  paths.  This  is  much  more  involved  than  it  appears,  because  the  change  of  a  single  edge  can 
modify  the  shortest  paths  among  multiple  vertices. 

Consider  e(a,b)  €  V.  If  sp(a,b)  is  not  simply  e(a,b),  deletion  of  e(a,b )  has  no  effect  on  the 
geodesic  distances.  Hence,  we  shall  suppose  that  sp(a,b)  consists  of  the  single  edge  e(a,b).  We 
propagate  the  effect  of  the  removal  of  e(a,  b)  to  the  set  of  vertices  Ra^  (Figure  3.1).  Ra ^  is  used  in 
turn  to  construct  F^a  ^\,  the  set  of  all  (i,j)  pairs  with  e(a,  b)  in  sp(i,j).  This  is  done  by  ConstructFab 
(Algorithm  3.1),  which  finds  all  the  vertices  under  T{a;b )  such  that  sp(u,t)  contains  vu,  where 
u  €  Rajj.  The  set  of  vertex  pairs  whose  shortest  paths  are  invalidated  due  to  the  removal  of  edges 
in  V  is  thus  F  =  Ug^a  b)£D^{a  by  The  shortest  path  distances  between  these  vertex  pairs  are 
updated  by  ModiRedDijkstra  (Algorithm  3.2)  with  source  vertex  Vu  and  destination  vertices  C(u). 
It  is  similar  to  the  Dijkstra’s  algorithm,  except  that  only  the  geodesic  distances  from  Vu  to  C(u) 
(instead  of  all  the  vertices)  are  unknown.  Note  that  both  Vu  and  C(u)  are  derived  from  F. 

The  order  of  the  source  vertex  in  invoking  ModiRedDijkstra  can  impact  the  run  time  significantly. 
An  approximately  optimal  order  is  found  by  interpreting  F  as  an  auxiliary  graph  B  (the  undirected 
edge  e(i,j)  is  in  B  iff  (i,  j)  £  F),  and  removing  the  vertices  in  B  with  the  smallest  degree  in  a  greedy 


49 


1:  Input:  it;  C(u);  {pq};  {wij} 

2:  Output:  the  updated  geodesic  distances  { guv } 

3:  for  all  j  G  C(u)  do 
4:  H  :=adj(j)n(V/C(u)); 

5:  S(j)  =  minfcgtf  ( guk  +  Wkj),  or  oo  if  H  =  0; 

6:  Insert  S(j)  to  a  heap  with  index  j; 

7:  end  for 

8:  while  the  heap  is  not  empty  do 

9:  k  :=  the  index  of  the  entry  by  “Extract  Min”  on  the  heap; 

10:  C(u)  :=  C{v)/{k};guk  :=  S(k);gku  ■=  5{k); 

11:  for  all  j  G  adj(k )  fl  C(u)  do 

12:  dist  :=  guk  +  wkj\ 

13:  If  guk  +  wkj  <  5(j),  perform  “Decrease  Key”  on  5(j)  to  become  dist; 

14:  end  for 

15:  end  while 

Algorithm  3.2:  ModifiedDijkstr a:  The  geodesic  distances  from  the  source  vertex  u  to  the  set  of 
vertices  C(u)  are  updated. 


1:  Input:  Auxiliary  graph  B 

2:  Output:  None.  The  geodesic  distances  are  updated  as  a  side-effect 
3:  l[i\  :=  an  empty  linked  list,  for  i  =  1, . . . ,  n; 

4:  for  all  vu  G  B  do 

5:  /  :=  degree  of  vu  in  B.  Insert  vu  to  l[f]; 

6:  end  for 
7:  pos  :=  1; 

8:  for  i  :=  1  to  n  do 

9:  If  l\pos\  is  empty,  increment  pos  one  by  one  and  until  l[pos\  is  not  empty; 

10:  Remove  vu,  a  vertex  in  /[pos],  from  the  graph  B; 

11:  Call  ModiEedDijkstra(u,  adj(u)  in  B); 

12:  for  all  Vj  that  is  a  neighbor  of  vu  in  B  do 

13:  Find  /  such  that  Vj  G  /[/]  by  an  indexing  array; 

14:  Remove  Vj  from  /[/]  if  /  =  1,  and  move  Vj  from  /[/]  to  /[/  —  1]  otherwise; 

15:  pos  =  min(pos,  /  —  1); 

16:  end  for 

17:  end  for 

Algorithm  3.3:  OptimalOrder:  a  greedy  algorithm  to  remove  the  vertex  with  the  smallest  degree 
in  the  auxiliary  graph  B.  The  removal  of  vu  corresponds  to  the  execution  of  ModifiedDijsktra 
(Algorithm  3.2)  with  u  as  the  source  vertex. 


manner  ( OptimalOrder ,  Algorithm  3.3).  When  Vu  is  removed  from  B ,  ModifiedDijkstra  is  called 
with  source  vertex  Vu  and  C(u)  as  the  neighbors  of  Vu  in  B. 

The  next  stage  of  the  algorithm  finds  the  geodesic  distances  between  and  the  other  vertices. 
Since  all  the  edges  in  A  (edges  to  be  inserted)  are  incident  on  cn_|_i,  we  have 

9n+l,i  =  9i,n+ 1  =  su“^hat  ( 9ij  +  wj,n+ 1)  (3-3) 

e(n+l,j)&A 

Finally,  we  consider  how  A  can  shorten  other  geodesic  distances.  This  is  done  by  first  locating 
all  the  vertex  pairs  {va,v^),  both  adjacent  to  such  that  v ^  — >  ^n_pi  — >  va  is  a  better  shortest 

path  between  va  and  v^.  Starting  from  va  and  vu,  Updatelnsert  (Algorithm  3.4)  searches  for  all  the 
vertex  pairs  that  can  use  the  new  edge  for  a  better  shortest  path,  based  on  the  updated  graph. 

For  all  the  priority  queues  in  this  section,  binary  heap  is  used  instead  of  the  asymptotically  faster 
Fibonacci’s  heap.  Since  the  size  of  our  heap  is  typically  small,  binary  heap,  with  a  smaller  time 
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(a)  An  example  of  neighborhood  graph 


Figure  3.1:  The  edge  e(a,  b)  is  to  be  deleted  from  the  neighborhood  graph  shown  in  (a).  The  shortest 
path  tree  T (a)  is  shown  as  directed  arrows  in  (b).  Ra ^  (c.f.  Algorithm  3.1)  consists  of  all  the  vertices 
Vu  such  that  sp(b,u)  contains  e(a,b),  i.e.,  iru ^  =  a. 


Figure  3.2:  Effect  of  edge  insertion.  T(a)  before  the  insertion  of  vn_^_  [  is  represented  by  the  arrows 
between  vertices.  The  introduction  of  creates  a  better  path  between  vq  and  v^.  S  denotes  the 

set  of  vertices  such  that  t  £  S  iff  sp(b,t)  is  improved  by  vn_^\.  Note  that  Vf  must  be  in  T(n+l;a). 
For  each  u  £  S,  Updatelnsert  (Algorithm  3.4)  finds  t  such  that  sp(u,t )  is  improved  by  starting 

with  t  =  b. 


constant,  is  likely  to  be  more  efficient. 


3. 2. 1.3  Finding  the  Co-ordinates  of  the  New  Sample 

The  co-ordinate  xn_|_i  is  found  by  matching  its  inner  product  with  to  the  values  derived  from  the 
geodesic  distances.  This  approach  is  in  the  same  spirit  as  the  classical  scaling  [49]  used  in  ISOMAP. 
Define  7^  =  ||Xj  —  Xj||^  =  |[x^|p  +  ||x^-||^  —  2xJ"xy.  Since  Y^i= \  xi  =  summation  over  j  and 
then  over  i  for  7^  leads  to 

iix*ii2  =  ^(E^-Eiixjii2)> 

j  j 

dm2=^i :v 

j  u 

Similarly,  if  we  define  7^  =  ||x^  —  xn_^||  ,  we  have 

..  n  n 

llxn+ll|2  = -(E^  E  IWI2)’ 

i= 1  i= 1 

xn+lxi  =  “j  (7i  -  llxn+l  l|2  -  llxzl|2) 
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1 

Input:  a;  6;  {<?,,}:  {wtJ} 

2 

Output:  { g,.j }  are  updated  because  of  the  new  shortest  path  va  — >  vn+i  — ►  Vb ■ 

3 

S  :=  0;  Q.enqueue(a); 

4 

while  Q.notEmpty  do 

5 

t  :=  Q. pop;  S  :=  S  U  {f}; 

6 

for  all  vu  that  are  children  of  vt  in  T(n  +  1)  do 

7 

if  Qu,n-\-l  H-  ^n+1,6  ^  Qu,b  then 

8 

Q.enqueue(-u); 

9 

end  if 

10 

end  for 

11 

end  while},!?  has  been  constructed.} 

12 

for  all  u  G  S  do 

13 

Q.enqueue(b); 

14 

while  Q.notEmpty  do 

15 

t  •—  Q.pop;  Qut  •—  Qtu  • =  Qu,n-\- 1  Qn-\-l,t\ 

16 

for  all  vs  that  are  children  of  vt  in  T(n  +  1)  do 

17 

if  Qs,n-\- 1  +  yJn-\-i,a  <C  9s, a  then 

18 

Q.  enqueue  (s); 

19 

end  if 

20 

end  for 

21 

end  while 

22 

end  for{V  u  €  S,  update  sp(u,t)  if  vn+ 1  helps.} 

Algorithm  3.4:  Updatelnsert:  given  that  va  — >  vn_i_i  — >  is  a  better  shortest  path  between  va 

and  Vfo  after  the  insertion  of  vnjr\,  its  effect  is  propagated  to  other  vertices. 


~  9  9 

If  we  approximate  7 jj  by  g^,-  and  7^  by  njr\,  the  target  inner  product  fj  between  xn_j_i  and  x^ 
can  be  estimated  by 

/  I  l  j  9lj  'X—j I  fJl  n 

«  -J-JL  ~  —Y1  + - -  s?n+1-  3.4 

n  nz  n  '  ^ 

rp  T 

xn+i  is  obtained  by  solving  X1  xn_i_^  =  f  in  the  least-square  sense,  where  f  =  (f\, . . . ,  fn)1  •  One 

way  to  interpret  the  least  square  solution  is  by  noting  that  X  =  •••  ,  where 

(A j,  Vj)  is  an  eigenpair  of  the  target  inner  product  matrix.  The  least  square  solution  can  be  written 

x"+1  =  (75Tv^f’ ' "  ’  75'^ f)T  (3‘5) 

The  same  estimate  is  obtained  if  Nystrom  approximation  [89]  is  used. 

A  similar  procedure  is  used  to  compute  the  out-of-sample  extension  of  ISOMAP  in  [55,  17]. 
However,  there  is  an  important  difference:  in  these  studies,  the  inner  product  between  the  new 
sample  and  the  existing  points  is  estimated  by 

n  (j?. 

1'  M 

3= 1 

It  is  unclear  how  this  estimate  is  derived.  This  estimate  is  different  from  that  in  Equation  (3.4) 

r\  _  r\  r\ 

because  gfn^/n—YUj  dfj/nZ  does  not  vanish  in  general;  in  fact,  most  of  the  time  this  is  a  large 
number.  Empirical  comparisons  indicate  that  our  inner  product  estimate  given  in  Equation  (3.4)  is 
much  more  accurate  than  the  one  in  Equation  (3.6). 

Finally,  the  new  mean  is  subtracted  from  x^,  i  =  1, . . . ,  (n  +  1),  to  ensure  xi  =  0,  in  order 

to  conform  to  the  convention  in  the  standard  ISOMAP. 
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3. 2. 1.4  Updating  the  Co-ordinates 

The  co-ordinates  x  •  should  be  updated  in  view  of  the  modified  geodesic  distance  matrix  Gnew-  This 
can  be  viewed  as  an  incremental  eigenvalue  problem,  as  x.-  can  be  obtained  by  eigen-decomposition. 
However,  since  the  size  of  the  geodesic  distance  matrix  is  increasing,  traditional  methods  (such  as 
those  described  in  [270]  or  [30])  cannot  be  applied  directly.  We  update  X  by  finding  the  eigenvalues 
and  eigenvectors  of  Bnew  by  an  iterative  scheme.  Note  that  gradient  descent  can  be  used  instead 
[168], 

A  good  initial  guess  for  the  subspace  of  dominant  eigenvectors  of  Bnew  is  the  column  space 
of  X 1  .  Subspace  iteration  together  with  Rayleigh-Ritz  acceleration  [96]  is  used  to  find  a  better 
eigen-space: 

1.  Compute  Z  =  BnewV  and  perform  QR  decomposition  on  Z,  i.e.,  we  write  Z  =  QR  and  let 

V  =  Q. 

2.  Form  Z  =  W1  BnewV  and  perform  eigen-decomposition  of  the  d  by  d  matrix  Z.  Let  Aj  and 

be  the  i-th  eigenvalue  and  the  corresponding  eigenvector. 

3.  Vnew  =  V[u^  . . .  u(j]  is  the  improved  set  of  eigenvectors  of  Bnew- 

Since  d  is  small,  the  time  for  eigen-decomposition  of  Z  is  negligible.  We  do  not  use  any  variant 

O 

of  inverse  iteration  because  Bnew  is  not  sparse  and  its  inversion  takes  0(n°)  time. 

3. 2. 1.5  Complexity 

In  Appendix  A. 4,  we  show  that  the  overall  complexity  of  the  geodesic  distance  update  can  be 

O 

written  as  0(q(\F\  +  \H\)  +  gvlogv  +  \A\  ),  where  F  and  H  contain  vertex  pairs  whose  geodesic 
distances  are  lengthened  and  shortened  because  of  r;n+l’  respectively,  q  is  the  maximum  degree  of 
the  vertices  in  the  graph,  p  is  the  number  of  vertices  with  non-zero  degree  in  B ,  and  v  =  maxj  k?;. 
Here,  is  the  degree  of  the  i-th  vertex  removed  from  the  auxiliary  graph  B  in  Algorithm  3.3.  We 
conjecture  that  v ,  on  average,  is  of  the  order  0(log/i).  Note  that  p.  <  2\F\.  The  complexity  is  thus 
0(q(\F\  +  \F[\)  +  p  log  p  log  log /r  +  |-4p).  In  practice,  the  first  two  terms  dominate,  leading  to  the 
effective  complexity  0(q(\F\  +  \H\). 

We  also  want  to  point  out  that  Algorithm  3.2  is  fairly  efficient;  its  complexity  to  solve  the  all- 

c\  r\ 

pairs  shortest  path  by  updating  all  geodesic  distances  is  CHn^logn  +  n^q).  This  is  the  same  as  the 
complexity  of  the  best  known  algorithm  for  the  all-pairs  shortest  path  problem  of  a  sparse  graph, 
which  involves  running  Dijkstra’s  algorithm  multiple  times  with  different  source  vertices.  For  the 
update  of  co-ordinates,  subspace  iteration  takes  0(nz)  time  because  of  the  matrix  multiplication. 

3.2.2  ISOMAP  With  Landmark  Points 

One  drawback  of  the  original  ISOMAP  is  its  quadratic  memory  requirement:  the  geodesic  distance 
matrix  is  dense  and  is  of  size  0(nz),  making  ISOMAP  infeasible  for  large  data  sets.  Landmark 
ISOMAP  was  proposed  in  [55]  to  reduce  the  memory  requirement  while  lowering  the  computation 
cost.  Instead  of  all  the  pairwise  geodesic  distances,  landmark  ISOMAP  finds  a  mapping  that  pre¬ 
serves  the  geodesic  distances  originating  from  a  small  set  of  “landmark  points”.  This  idea  is  not 
entirely  new,  and  the  authors  in  [25]  refer  to  it  as  the  “reference  point  approach”  in  the  context  of 
embedding. 

Without  loss  of  generality,  let  the  first  m  points,  i.e.,  yi,...,ym>  be  the  landmark  points. 
After  constructing  the  neighborhood  graph  as  in  the  original  ISOMAP,  landmark  ISOMAP  uses 
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X:=0; 

for  all  (n,  Si,  w°ld,  wfew)  in  the  input  do 

Swap  n  and  Si  if  vri  is  a  child  of  vSi  in  T(a); 
if  vSi  is  a  child  of  vri  in  T(a)  then 
J  :=  {uSi}U  descendent  of  vSi  in  T(a); 
gaj  =  gaj  +  w?ew  -  wfd  Vj  e  J; 

1  =  1  u  J\ 

end  if 

end  for 

for  all  j  €  J  do 

b  :=  minfceo<yy)  gak  +  Wkj ;  {Find  a  new  path  to  Vj} 

Q. enqueue (i,  arg  minfceo<yy)  gak  +  wkj,  b)  if  b  <  gaj 

end  for 

Algorithm  3.5:  InitializeEdgeWeightlncrease  for  the  shortest  path  tree  from  va,  T(a).  The  inputs 
are  the  four  tuples  (r^,  s^,w^^,w^ew),  meaning  the  weight  of  e(r^,rj)  should  increase  from 
to  uTew.  Q  is  the  queue  of  vertices  to  be  processed  in  Algorithm  3.7. 

the  Dijkstra’s  algorithm  to  compute  the  m  x  n  landmark  geodesic  distance  matrix  G  =  {gjj}, 
where  g^j  is  the  length  of  the  shortest  path  between  v ^  (a  landmark  point)  and  Vj.  In  [55]  the 
authors  suggest  that  X  can  be  found  by  first  embedding  the  landmark  points  and  then  embedding 
the  remaining  points  with  respect  to  the  landmark  points.  This  is  similar  to  the  modification  of 
the  Sannnon’s  mapping  made  by  Biswas  et  al.  in  [25]  to  cope  with  large  data  sets.  However, 
our  preliminary  experiments  indicate  that  this  is  not  very  robust,  particularly  when  the  number 
of  landmark  points  is  small.  Instead,  we  follow  the  implementation  of  landmark  ISOMAPz  and 
decompose  B  =  HmGE^  by  singular  value  decomposition,  B  =  USV^  =  (U(S)^/^)(V(S)^/^)^, 
where  U  and  V 1  V  are  identity  matrices  of  corresponding  sizes,  and  S  is  a  diagonal  matrix  of 
singular  values.  The  vectors  corresponding  to  the  largest  d  singular  values  are  used  to  construct  a 
low-rank  approximation,  B  ss  X. 

3. 2. 2.1  Incremental  Landmark  ISOMAP 

After  updating  the  neighborhood  graph,  the  incremental  version  for  landmark  ISOMAP  proceeds 
with  the  update  of  geodesic  distances.  Since  only  the  shortest  paths  from  a  small  number  of  source 
vertices  are  maintained,  the  computation  that  can  be  shared  among  different  shortest  path  trees 
is  limited.  Therefore,  we  update  the  shortest  path  trees  independently  by  adopting  the  algorithm 
I  presented  in  [193],  instead  of  the  algorithm  in  section  3. 2. 1.2.  First,  Algorithm  3.5  is  called  to 
initialize  the  edge  weight  increase,  which  includes  edge  deletion  as  a  special  case.  Algorithm  3.7  is 
then  executed  to  rebuild  the  shortest  path  tree.  Algorithm  3.6  is  then  called  to  initialize  the  edge 
weight  decrease,  which  includes  edge  insertion  as  a  special  case.  Algorithm  3.7  is  again  called  to 
rebuild  the  tree.  Deletion  of  edges  is  done  before  the  addition  of  edges  because  this  is  more  efficient 
in  practice. 

The  co-ordinate  of  the  new  point  xn_i_^  is  determined  by  solving  a  least-square  problem  similar 
to  that  in  section  3. 2. 1.3.  The  difference  is  that  the  columns  of  Q,  instead  of  X,  are  used.  So, 
Q  xn  |  |  =  f  is  solved  in  the  least-square  sense.  Finally,  we  use  subspace  iteration  together  with 
Ritz  acceleration  [236]  to  improve  singular  vector  estimates.  The  steps  are 

1.  Perform  SVD  on  the  matrix  BX,  U^v'f  =  BX 


2 We  are  referring  to  the  “official”  implementation  by  the  authors  of  ISOMAP  in  http://isomap.stanford.edu. 
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X:=0; 

for  all  (n,  Si,  w°ld,  wfew)  in  the  input  do 
Swap  n  and  s;  if  ga,n  >  ga,Si; 
diff  :=  ga,n  +  w“ew  —  ga,Si', 
if  diff  <  0  then 

Move  vSi  to  be  a  child  of  vTi  in  T(a); 

J  :=  {vsj}U  descendent  of  vSi  in  T(a); 

9aj  =  gaj  +  diff  Mj  €  J\ 

T  =  ll_S  J\ 

end  if 
end  for 

for  all  j  G  J  do 

for  all  k  G  adj(j)  do 

Q.enqueue(fc,  j,  gaj  +  Wjk)  if  gaj  +  Wjk  <  gak 

end  for 
end  for 

Algorithm  3.6:  InitializeEdgeWeightDecrease  for  the  shortest  path  tree  from  va,  T(a).  The  inputs 
are  the  four  tuples  (rj,  Sj,  w^ew),  meaning  the  weight  of  e(r^,rj)  should  decrease  from 
to  iu^ew .  Q  is  the  queue  of  vertices  to  be  processed  in  Algorithm  3.7. 


while  Q.notEmpty  do 

( i,j,d )  :=  ‘Extract  Min”  on  Q\ 
diff  =d-  gai\ 
if  diff  <  0  then 

Move  Vi  to  be  a  child  of  vj  in  T(a); 
gai  —  d; 

for  all  k  G  adj(i)  do 
newd  =  gai  +  Wife; 

Q.enqueue(k,i,neuid)  if  newd  <  gak ; 

end  for 
end  if 
end  while 

Algorithm  3.7:  Rebuild  T(a)  for  those  vertices  in  the  priority  queue  Q  that  need  to  be  updated. 

2.  Perform  SVD  on  the  matrix  BTVh  U2S2V^  =  BTIJ1 

3.  Set  Xnew  =  U2(S2)1/2  and  Qnew  =  U1(S2)1/2 

As  far  as  time  complexity  is  concerned,  the  time  to  update  one  shortest  path  tree  is 
0(5^  log  +  qSfl),  where  5(j  is  the  minimum  number  of  nodes  that  must  change  their  distance 
or  parent  attributes  or  both  [193],  and  q  is  the  maximum  degree  of  vertices  in  the  neighborhood 
graph.  The  complexity  of  updating  the  singular  vectors  is  0(nm),  which  is  linear  in  n,  because  the 
number  of  landmark  points  m  is  fixed. 


3.2.3  Vertex  Contraction 

Owing  to  the  non-parametric  nature  of  ISOMAP,  the  data  points  collected  need  to  be  stored  in  the 
memory  in  order  to  refine  the  estimation  of  the  geodesic  distances  g^j  and  the  co-ordinates  x?; .  This 
can  be  undesirable  if  we  have  an  arbitrarily  large  data  stream. 

One  simple  solution  is  to  discard  the  oldest  data  point  when  a  pre-determined  number  of  data 
points  has  been  accumulated.  This  has  the  additional  advantage  of  making  the  algorithm  adaptive 
to  drifting  in  data  characteristics.  The  deletion  should  take  place  after  the  completion  of  all  the 
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updates  due  to  the  new  point.  Deleting  the  vertex  Uj  is  easy:  the  edge  deletion  procedure  is  used  to 
delete  all  the  edges  incident  on  v%  for  both  ISOMAP  and  landmark  ISOMAP. 

We  can  do  better  than  deletion,  however.  A  vertex  contraction  heuristic  can  be  used  to  record 
the  improvement  in  geodesic  distance  estimate  without  storing  additional  points.  Most  of  the  in¬ 
formation  the  new  vertex  vn  ,  i  contains  about  the  geodesic  distance  estimate  is  represented  by  the 
shortest  paths  passing  through  vn_^\.  Suppose  sp(a,b)  can  be  written  as  va  — >  vnjr\  — »  v^. 

The  geodesic  distance  between  va  and  v ^  can  be  preserved  by  introducing  a  new  edge  e(i,  b)  with 
weight  (uij  n-\-\  +  wn+l  b )>  even  though  vn+i  is  deleted.  Both  the  shortest  path  tree  T(a)  and 
the  graph  are  updated  in  view  of  this  new  edge.  This  procedure  cannot  create  inconsistency  in  any 
shortest  path  trees,  because  the  subpath  of  any  shortest  path  is  also  a  shortest  path.  This  heuristic 
increases  the  density  of  the  edges  in  the  graph,  however. 

Which  vertex  should  be  contracted?  A  simple  choice  is  to  contract  the  new  vertex  vn_^_i  after 
adjusting  for  the  change  of  geodesic  distances.  Alternatively,  we  can  delete  the  vertices  that  are 
most  “crowded”  so  that  the  points  are  spread  more  evenly  along  the  manifold.  This  can  be  done  by 
contracting  the  non-landmark  point  whose  nearest  neighbor  is  the  closest  to  itself. 


3.3  Experiments 

We  have  implemented  our  main  algorithm  in  Matlab,  with  the  graph  theoretic  parts  written  in  CH — h 
The  running  time  is  measured  on  a  Pentium  IV  3.2  GHz  PC  with  512MB  memory  running  Windows 
XP,  using  the  profiler  of  Matlab  with  the  java  virtual  machine  turned  off. 


3.3.1  Incremental  ISOMAP:  Basic  Version 


We  evaluated  the  accuracy  and  the  efficiency  of  our  incremental  algorithm  on  several  data  sets.  The 
first  experiment  was  on  the  Swiss  roll  data  set.  It  is  a  typical  benchmark  for  manifold  learning. 
Because  of  its  “roll”  nature,  geodesic  distances  are  more  appropriate  in  understanding  the  structure 
of  this  data  set  than  Euclidean  distances.  Initialization  was  done  by  finding  the  co-ordinate  estimate 
x?;  for  100  randomly  selected  points  using  the  “batch”  ISOMAP,  with  a  fcnn  neighborhood  of  size 
6.  Random  points  from  the  Swiss  roll  data  set  were  added  one  by  one,  until  1500  points  were 
accumulated.  The  incremental  algorithm  described  in  section  3.2.1  was  used  to  update  the  co¬ 
ordinates.  The  first  two  dimensions  of  x;  corresponded  to  the  true  structure  of  the  manifold. 
The  gap  between  the  second  and  the  third  eigenvalues  is  fairly  significant  and  it  is  not  difficult  to 
determine  the  intrinsic  dimensionality  as  two  for  this  data  set.  Figure  3.3  shows  several  snapshots 


of  the  algorithm^.  The  black  dots  (x^)  and  the  red  circles  (xW;)  correspond  to  the  co-ordinates 
estimated  by  the  incremental  and  the  batch  version  of  ISOMAP,  respectively.  The  red  circles  and  the 
black  dots  match  very  well,  indicating  that  the  co-ordinates  updated  by  the  incremental  ISOMAP 
follow  closely  with  the  co-ordinates  estimated  by  the  batch  version.  This  closeness  can  be  quantified 
by  an  error  measure  defined  as  the  square  root  of  the  mean  square  error  between  x|n^  and  xj^, 
normalized  by  the  total  sample  variance: 


£n  — 


1  sr^n 
n  l^j=  1 


\ 


1 

n 


i= 1 


(3.7) 


3The  avi  files  can  be  found  at  http://www.cse.msu.edu/prip/ResearchProjects/iisomap/. 
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Table  3.1:  Run  time  (seconds)  for  batch  and  incremental  ISOMAP.  For  batch  ISOMAP,  computation  of  xn_i_i  and  updating  of  are  performed 
together.  Hence  there  is  only  one  combined  run  time. 


Swiss  roll 

S-curve 

Rendered  face 

MNIST  2 

ethn 

Batch 

Incr. 

Batch 

Incr. 

Batch 

Incr. 

Batch 

Incr. 

Batch 

Incr. 

Neighborhood  graph 

230.0 

1.4 

229.9 

1.2 

20.8 

0.4 

239.7 

239.3 

1.5 

Geodesic  distance 

1618.5 

53.2 

1632.8 

56.2 

157.8 

5.8 

1683.9 

41.0 

1752.9 

39.6 

Computing 

804.6 

0.9 

760.0 

0.8 

85.1 

0.3 

671.3 

0.8 

645.2 

1.0 

49.5 

49.4 

5.5 

51.5 

48.9 

Table  3.2:  Run  time  (seconds)  for  executing  batch  and  incremental  ISOMAP  once  for  different  number  of  points  (n).  “Dist”  corresponds  to  the  time 
for  distance  computation  for  all  the  n  points. 


n 

Swiss  roll 

S-curve 

Rendered  face 

MNIST  2 

ethn 

Dist. 

Batch 

Incr. 

Dist. 

Batch 

Incr. 

Dist. 

Batch 

Incr. 

Dist. 

Batch 

Incr. 

Dist. 

Batch 

Incr. 

500 

0.09 

0.66 

0.04 

0.09 

0.64 

0.04 

0.16 

0.60 

0.05 

0.28 

0.62 

0.04 

1.19 

0.61 

0.04 

1000 

0.38 

2.62 

0.14 

0.38 

2.47 

0.11 

N/A 

N/A 

N/A 

1.09 

2.34 

0.07 

4.52 

2.45 

0.08 

1500 

0.84 

5.72 

0.17 

0.84 

5.65 

0.25 

N/A 

N/A 

N/A 

2.42 

5.41 

0.18 

10.06 

5.65 

0.15 

30 


Cn 

00 


Figure  3.3:  Snapshots  of  “Swiss  Roll"’  for  incremental  ISOMAP.  In  the  first  column,  the  circles  and  dots  in  the  figures  represent  the  co-ordinates 
estimated  by  the  batch  and  the  incremental  version,  respectively.  The  square  and  asterisk  denote  the  co-ordinates  of  the  newly  added  point,  estimated 
by  the  batch  and  the  incremental  algorithm,  respectively.  The  second  column  contains  scatter  plots,  where  the  color  of  a  point  corresponds  to  the  value 
of  the  most  dominant  co-ordinate  estimated  by  ISOMAP.  The  third  column  illustrates  the  neighborhood  graphs,  from  which  the  geodesic  distances 
are  estimated. 
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Figure  3.4:  Approximation  error  (£n)  between  the  co-ordinates  estimated  by  the  basic  incremental 
ISOMAP  and  the  basic  batch  ISOMAP  for  different  numbers  of  data  points  (n)  for  the  five  data 
sets. 
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Figure  3.4(a)  displays  £n  against  the  number  of  data  points  n  for  Swiss  roll.  We  can  see  that  the 
proposed  updating  method  is  fairly  accurate,  with  an  average  error  of  0.05%.  The  “spikes”  in  the 
graph  correspond  to  the  instances  where  many  geodesic  distances  change  dramatically  because  of 
the  creation  or  deletion  of  “short-cuts”  in  the  neighborhood  graph.  These  large  errors  fade  very 
quickly,  however,  as  evident  from  the  graph. 

Table  3.1  shows  the  computation  time  decomposed  into  different  tasks.  Our  incremental  approach 
has  significant  savings  in  all  three  aspects  of  ISOMAP:  graph  update,  geodesic  distance  update,  and 
co-ordinate  update.  The  computation  time  for  the  distances  is  not  included  in  the  table,  because  both 
batch  and  incremental  versions  perform  the  same  number  of  distance  computations.  Empirically, 
we  observed  that  for  moderate  number  of  data  points,  the  time  to  update  the  geodesic  distances  is 
longer  than  the  time  to  update  the  co-ordinates,  whereas  the  opposite  is  true  when  a  large  number 
of  points  have  been  collected.  This  is  probably  due  to  the  fact  that  the  geodesic  distances  change 
more  rapidly  when  only  a  moderate  amount  of  data  are  collected,  whereas  the  time  for  matrix 
multiplication  becomes  more  significant  with  a  larger  number  of  co-ordinates.  We  have  also  run 
the  batch  algorithm  once  for  different  numbers  of  data  points  (n).  Table  3.2  shows  the  measured 
time  averaged  over  5  identical  trials,  after  excluding  the  time  for  distance  computation.  The  time  for 
computing  the  distances  for  all  the  n  points,  together  with  the  time  to  run  the  incremental  algorithm 
once  to  update  when  the  n-th  point  arrives,  is  also  included  in  the  table.  See  Section  3.4  for  further 
discussion  of  the  result. 

The  co-ordinates  estimated  with  different  number  of  data  points  are  also  compared  with  the 
co-ordinates  estimated  with  all  the  available  data  points.  This  can  give  us  an  additional  insight 
on  how  the  estimated  co-ordinates  evolve  to  their  final  values  as  new  data  points  gradually  arrive. 
Some  snapshots  are  shown  in  Figure  3.5. 

A  similar  experimental  procedure  was  applied  to  other  data  sets.  The  “S-curve”  data  set,  another 
benchmark  for  manifold  learning,  contains  points  in  a  3D  space  lying  on  a  “S”-sliaped  surface,  with 
an  effective  dimensionality  of  two.  The  “rendered  face”  data  set^  contains  698  face  images  with 
size  64  by  64  rendered  at  different  illumination  and  pose  conditions.  Some  examples  are  shown  in 
Figure  3.6.  The  “MNIST  digit  2”  data  set  is  derived  from  the  digit  images  “2”  from  MNIST®,  and 
contains  28  by  28  digit  images.  Several  typical  images  are  shown  in  Figure  3.7.  The  rendered  face 
data  set  and  the  MNIST  digit  2  data  sets  were  used  in  the  original  ISOMAP  paper  [248].  Our  last 
data  set,  ethn,  contains  the  face  images  used  in  [175].  The  task  of  this  data  set  is  to  classify  a  64  by 
64  face  image  as  Asian  or  non-Asian.  This  database  contains  1320  images  for  Asian  class  and  1310 
images  for  non- Asian  class,  and  is  composed  of  several  face  image  databases,  including  the  PF01 
database®,  the  Yale  database”'4 5 * 7 8  the  AR  database  [181],  as  well  as  the  non-public  NLPR  database^. 
Some  example  face  images  are  shown  in  Figure  3.8.  For  all  these  images,  the  high  dimensional 
feature  vectors  were  created  by  concatenating  the  image  pixels.  The  neighborhood  size  for  MNIST 
digit  2  and  ethn  was  set  to  10  in  order  to  demonstrate  that  the  proposed  approach  is  efficient  and 
accurate  irrespective  of  the  neighborhood  used.  The  approximation  error  and  the  computation  time 
for  these  data  sets  are  shown  in  Figure  3.4  and  Table  3.1.  We  can  see  that  the  incremental  ISOMAP 
is  accurate  and  efficient  for  updating  the  co-ordinates  for  all  these  data  sets. 

Since  the  ethn  data  set  is  from  a  supervised  classification  problem  with  two  classes,  we  also 
want  to  investigate  the  quality  of  the  ISOMAP  mapping  with  respect  to  classification.  This  is 


4http : //isomap . stanford.edu 

5http : //yann. lecun. com/exdb/mnist/. 

5http : //nova.postech. ac . kr/archives/imdb .html. 

7 http : //cvc .yale . edu/projects/yalef aces/yalef aces .html. 

8 Provided  by  Dr.  Yunhong  Wang,  National  Laboratory  for  Pattern  Recognition,  Beijing. 
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(e)  n  =  1200  (f)  Final,  n  =  1500 


Figure  3.5:  Evolution  of  the  estimated  co-ordinates  for  Swiss  roll  to  their  final  values.  The  black 
dots  denote  the  co-ordinates  estimated  with  different  number  of  samples,  whereas  red  circles  show 
the  co-ordinates  estimated  with  all  the  1500  points.  The  co-ordinates  have  been  re-scaled  to  better 
observe  the  trend. 


Figure  3.6:  Example  images  from  the  rendered  face  image  data  set.  This  data  set  can  be  found  at 
the  ISOMAP  web-site. 


i \3L\jf \m 


Figure  3.7:  Example  “2”  digits  from  the  MNIST  database.  The  MNIST  database  can  be  found  at 

http : //yann . lecun. com/ exdb/mnist/. 
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Figure  3.8:  Example  face  images  from  ethn  database. 


Figure  3.9:  Classification  performance  on  ethn  database  for  basic  ISOMAP. 
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done  quantitatively  by  computing  the  leave-one-out  nearest  neighbor  (with  respect  to  L 2  distance) 
error  rate  using  different  dimensions  of  the  co-ordinates  estimated  by  incremental  ISOMAP  with 
1500  points.  For  comparison,  we  project  the  data  linearly  to  the  best  hyperplane  by  PCA  and  also 
evaluate  the  corresponding  leave-one-out  error  rate.  Figure  3.9  shows  the  result.  The  representation 
recovered  by  ISOMAP  leads  to  a  smaller  error  rate  than  PCA.  Note  that  the  performance  of  PCA 
can  be  improved  by  rescaling  each  feature  so  that  all  of  them  have  equal  variance,  though  the 
rescaling  is  essentially  a  post-processing  step,  not  required  by  ISOMAP. 

3.3.2  Experiments  on  Landmark  ISOMAP 

A  similar  experimental  procedure  was  applied  to  the  incremental  landmark  ISOMAP  described  in 
section  3.2.2  for  Swiss  roll,  S-curve,  rendered  face,  MNIST  digit  2,  and  ethn  data  sets.  Starting 
with  200  randomly  selected  points  from  the  data  set,  random  points  were  added  until  a  total  of 
5000  points^  accumulated.  Forty  points  from  the  initial  200  points  were  chosen  randomly  to  be 
the  landmark  points.  Snapshots  comparing  the  co-ordinates  estimated  by  the  batch  version  and 
the  incremental  version  for  Swiss  roll  are  shown  in  Figure  3.10.  The  approximation  error  and  the 
computation  time  are  shown  in  Figure  3.11  and  Table  3.3,  respectively.  The  time  to  run  the  batch 
version  only  once  is  listed  in  Table  3.4.  Once  again,  the  co-ordinates  estimated  by  the  incremental 
version  are  accurate  with  respect  to  the  batch  version,  and  the  computation  time  is  much  less.  We 
also  consider  the  classification  accuracy  using  landmark  ISOMAP  on  all  the  2630  images  in  the  ethn 
data  set.  The  result  is  shown  in  Figure  3.12.  The  co-ordinates  estimated  by  landmark  ISOMAP 
again  lead  to  a  smaller  error  rate  than  those  based  on  PCA.  The  difference  is  more  pronounced  when 
the  number  of  dimensions  is  small  (less  than  five). 

3.3.3  Vertex  Contraction 

The  utility  of  vertex  contraction  is  illustrated  in  the  following  experiment.  Consider  a  manifold  of  a 
3-dinrensional  unit  hemisphere  embedded  in  a  10-dinrensional  space.  The  geodesic  on  this  manifold 
is  simply  the  great  circle,  and  the  geodesic  distance  between  and  X2  on  the  manifold  is  given 
by  cos~x(x2  X2).  Data  points  lying  on  this  manifold  are  randomly  generated.  With  K  =  6,  40 
landmark  points  and  1000  points  in  memory,  vertex  contraction  is  executed  until  10000  points  are 
examined.  The  geodesic  distances  between  the  landmark  points  X ^  and  the  points  in  memory  X™ 
are  compared  with  the  ground-truth,  and  the  discrepancy  is  shown  by  the  solid  line  in  Figure  3.13.  As 
more  points  are  encountered,  the  error  decreases,  indicating  that  vertex  contraction  indeed  improves 
the  geodesic  distance  estimate.  There  is,  however,  a  lower  limit  (around  0.03)  on  the  achievable 
accuracy,  because  of  the  finite  size  of  samples  retained  in  the  memory.  When  additional  points  are 
kept  in  the  memory  instead  of  being  contracted,  the  improvement  of  geodesic  distance  estimate  is 
significantly  slower  (the  dash-dot  line  in  Figure  3.13).  We  can  see  that  vertex  contraction  indeed 
improves  the  geodesic  distance  estimate,  partly  because  it  spreads  the  data  points  more  evenly,  and 
partly  because  more  points  are  included  in  the  neighborhood  effectively. 

3.3.4  Incorporating  Variance  By  Incremental  Learning 

One  interesting  use  of  incremental  learning  is  to  incorporate  invariance  by  “hallucinating”  training 
data.  Given  a  training  sample  y,j,  additional  training  data  ,  y^ , . . .  can  be  created  by  applying 
different  invariance  transformations  on  y,j.  The  amount  of  training  data  can  be  unbounded,  because 


9When  the  data  set  has  less  than  5000  points,  the  experiment  stopped  after  all  the  points  have  been  used. 
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Table  3.3:  Run  time  (seconds)  for  batch  and  incremental  landmark  ISOMAP.  For  batch  ISOMAP,  computation  of  xn  .|_  ]  and  updating  of  x?;  are 
performed  together.  Hence  there  is  only  one  combined  run  time. 
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Table  3.4:  Run  time  (seconds)  for  executing  batch  and  incremental  landmark  ISOMAP  once  for  different  number  of  points  (n).  “Dist”  corresponds 
to  the  time  for  distance  computation  for  all  the  n  points. 
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Figure  3.10:  Snapshots  of  “Swiss  roll”  for  incremental  landmark  ISOMAP.  In  the  first  column,  the  circles  and  dots  in  the  figures  represent  the 
co-ordinates  estimated  by  the  batch  and  the  incremental  version,  respectively.  The  square  and  asterisk  denote  the  co-ordinates  of  the  newly  added 
point,  estimated  by  the  batch  and  the  incremental  algorithm,  respectively.  The  second  column  contains  scatter  plots,  where  the  color  of  a  point 
corresponds  to  the  value  of  the  most  dominant  co-ordinate  estimated  by  ISOMAP.  The  third  column  illustrates  the  neighborhood  graphs,  from  which 
the  geodesic  distances  are  estimated.  It  is  similar  to  Figure  3.3,  except  that  the  landmark  version  of  ISOMAP  is  used  instead. 
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Figure  3.11:  Approximation  error  (£n)  between  the  co-ordinates  estimated  by  the  incremental  land¬ 
mark  ISOMAP  and  the  batch  landmark  ISOMAP  for  different  numbers  of  data  points  (n).  It 
is  similar  to  Figure  3.4,  except  that  incremental  landmark  ISO  MAP  is  used  instead  of  the  basic 
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Figure  3.12:  Classification  performance  on  ethn  database,  landmark  ISOMAP. 


the  number  of  possible  invariance  transformations  is  infinite.  This  unboundedness  calls  for  an 
incremental  algorithm,  which  can  accumulate  the  effect  of  the  data  generated.  This  idea  has  been 
exploited  in  [235]  for  improving  the  accuracy  in  digit  classification.  Given  a  digit  image,  simple 
distortions  like  translation,  rotation,  and  skewing  are  applied  to  create  additional  training  data  for 
improving  the  invariance  property  of  a  neural  network. 

We  tested  a  similar  idea  using  the  proposed  incremental  ISOMAP.  The  training  data  were  gen¬ 
erated  by  first  randomly  selecting  an  image  from  500  digit  “2”  images  in  the  MNIST  training  set. 
The  image  was  then  rotated  randomly  by  6  degree,  where  9  was  uniformly  distributed  in  [—30,30]. 
The  image  was  used  as  the  input  for  the  incremental  landmark  ISOMAP  with  40  landmarks  and 
a  memory  size  of  10000,  with  vertex  contraction  enabled.  The  training  was  stopped  when  60000 
training  images  were  generated.  We  wanted  to  investigate  how  well  the  rotation  angle  is  recovered 
by  the  nonlinear  mapping.  This  was  done  by  using  an  independent  set  of  digit  “2”  images  from 
the  MNIST  testing  set,  which  was  of  size  1032.  For  each  image  yW,  it  was  rotated  by  15  different 
angles:  30j/7  for  j  =  —7, . . . ,  7.  The  mappings  of  these  15  images,  x_y, . . .  ,Xj,  were  found  using 
the  out-of-sample  extension  of  ISOMAP.  If  ISOMAP  can  discover  the  rotation  angle,  there  should 
exist  a  linear  projection  direction  h  such  that  h.1  ss  + 1  for  all  i  and  /,  where  is  a  constant 
specific  to  y(*).  This  is  equivalent  to 


(3.8) 


well  the  rotation  angle  is  recovered  can  thus  be  quantified  by  the  residue  of  the  above  equation. 
For  comparison,  a  similar  procedure  was  applied  for  PCA  using  the  first  10000  generated  images. 
Figure  3.14  shows  the  result.  We  can  see  that  the  residue  for  ISOMAP  is  smaller  than  PCA, 
indicating  that  ISOMAP  recovers  the  rotation  angle  better.  The  residue  is  even  smaller  when 
additional  images  are  generated  to  improve  the  mapping. 
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Figure  3.13:  Utility  of  vertex  contraction.  Solid  line:  the  root-mean-square  error  (when  compared 
with  the  ground  truth)  of  the  geodesic  distance  estimate  for  points  currently  held  in  memory  when 
vertex  contraction  is  used.  Dash-dot  line:  the  corresponding  root-mean-square  error  when  the  new 
points  are  stored  in  the  memory  instead  of  being  contracted. 

3.4  Discussion 

We  have  presented  algorithms  to  incrementally  update  the  co-ordinates  produced  by  ISOMAP. 
Our  approach  can  be  extended  to  other  manifold  learning  algorithms;  for  example,  creating  an 
incremental  version  of  Laplacian  eigenmap  requires  the  update  of  the  neighborhood  graph  and  the 
leading  eigenvectors  of  a  matrix  (graph  Laplacian)  derived  from  the  neighborhood  graph. 

The  convergence  of  geodesic  distance  is  guaranteed  since  the  geodesic  distances  are  maintained 
exactly.  Subspace  iteration  used  in  co-ordinate  update  is  provably  convergent  if  a  sufficient  number 
of  iterations  is  used,  assuming  all  eigenvalues  are  simple,  which  is  generally  the  case.  The  fact  that 
we  only  run  subspace  iteration  once  can  be  interpreted  as  trading  off  guaranteed  convergence  with 
empirical  efficiency.  Since  the  change  in  target  inner  product  matrix  is  often  small,  the  eigenvector 
improvement  due  to  subspace  iterations  with  different  number  of  points  is  aggregated,  leading  to 
the  low  approximation  error  as  shown  in  Figures  3.4  and  3.11. 

While  running  the  proposed  incremental  ISOMAP  is  much  faster  than  running  the  batch  version 
repeatedly,  it  is  more  efficient  to  run  the  batch  version  once  using  all  the  data  points  if  only  the 
final  solution  is  desired  (compare  Tables  3.1  and  3.2,  as  well  as  Tables  3.3  and  3.4).  It  is  because 
maintaining  intermediate  geodesic  distances  and  co-ordinates  accurately  requires  extra  computation. 
The  incremental  algorithm  can  be  made  faster  if  the  geodesic  distances  are  updated  upon  seeing  p 
subsequent  points,  p  >  1.  We  first  embed  yn_|_i,  •  ■  ■ ,  Yn+p  independently  by  the  method  in  section 
3. 2. 1.3.  The  geodesic  distances  among  the  existing  points  are  not  updated,  and  the  same  set  of 
x?;  is  used  to  find  . . . ,  xn_| _p.  After  that,  all  the  geodesic  distances  are  updated,  followed  by 

the  update  of  x j , . . . ,  xn_pp  by  subspace  iteration.  This  strategy  makes  the  incremental  algorithm 
almost  p-times  faster,  because  the  time  to  embed  the  new  points  is  very  small  (see  the  time  for 
“computing  xn_|_ y ”  in  Tables  3.1  and  3.3).  On  the  other  hand,  the  quality  of  the  embedding  will 
deteriorate  because  the  embedding  of  the  existing  points  cannot  benefit  from  the  new  points.  This 
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Figure  3.14:  Sum  of  residue  square  for  1032  images  in  15  rotation  angles.  The  larger  the  residue,  the 
worse  the  representation.  “PCA”  and  “ISOMAP”  correspond  to  the  nonlinear  mapping  obtained 
by  PCA  and  ISOMAP  when  10000  generated  images  are  used  for  training,  respectively.  “ISOMAP 
II” /“PCA  II”  and  “ISOMAP  III” /“PCA  III”  correspond  to  the  result  when  the  learning  stops  after 
20000  and  50000  images  are  generated,  respectively. 


strategy  is  particularly  attractive  with  large  n,  because  the  effect  of  yn_|_i,  •  •  ■  ,yn+p  on  yn^p-|_i 
is  small. 

Also,  for  a  fixed  amount  of  memory,  the  solution  obtained  by  the  incremental  version  can  be 
superior  to  that  of  the  batch  version.  This  is  because  the  incremental  version  can  perform  vertex 
contraction,  thereby  obtaining  a  better  geodesic  distance  estimate.  The  incremental  version  can  be 
easily  adopted  to  an  unbounded  data  stream  when  training  data  are  generated  by  applying  invariance 
transformation,  too. 


3.4.1  Variants  of  the  Main  Algorithms 

Our  incremental  algorithm  can  be  modified  to  cope  with  variable  neighborhood  definition,  if  the 
user  is  willing  to  do  some  tedious  book-keeping.  We  can,  for  example,  use  e-neighborhood  with 
the  value  of  e  re-adjusted  whenever,  say,  200  data  points  have  arrived.  This  can  be  easily  achieved 
by  first  calculating  the  edges  that  need  to  be  deleted  or  added  because  of  the  new  neighborhood 
definition.  The  algorithms  in  sections  3.2.1  and  3.2.2  are  then  used  to  update  the  geodesic  distances. 
The  embedded  co-ordinates  can  then  be  updated  accordingly. 

The  supervised  ISOMAP  algorithm  in  [276],  which  utilizes  a  criterion  similar  to  the  Fisher 
discriminant  for  embedding,  can  also  be  converted  to  become  incremental.  The  only  change  is 
that  the  subspace  iteration  method  for  solving  a  generalized  eigenvalue  problem  is  used  instead. 
The  proposed  incremental  ISOMAP  can  be  easily  converted  to  incremental  conformal  ISOMAP 
[55].  In  conformal  ISOMAP,  the  edge  weight  w^j  is  A,y  / s/M (i)M (j),  where  M (i)  denotes  the 
distance  of  y,L  from  its  k  nearest  neighbors.  The  computation  of  the  shortest  path  distances  and 
eigen-decomposition  remains  the  same.  To  convert  this  to  its  incremental  counterpart,  we  need  to 
maintain  the  sum  of  the  weights  of  the  k  nearest  neighbors  of  different  vertices.  The  change  in 
the  edge  weights  due  to  the  insertion  and  deletion  of  edges  as  a  new  point  comes  can  be  easily 
tracked.  The  target  inner  product  matrix  is  updated,  and  subspace  iteration  can  be  used  to  update 
the  embedding. 
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3.4.2  Comparison  With  Out-of-sample  Extension 

One  problem  closely  related  to  incremental  nonlinear  mapping  is  the  “out-of-sample  extension” 
[17]:  given  the  embedding  xj, . . .  ,xn  for  a  “training  set”  yq,  . . . ,  y n,  what  is  the  embedding  result 
(xn+l)  f°r  a  “testing”  point  yn-fi?  This  is  effectively  the  problem  considered  in  section  3. 2. 1.3.  In 
incremental  learning,  however,  we  go  beyond  obtaining  xn_|_p  the  co-ordinate  estimates  x^, . . . ,  xn 
of  the  existing  points  are  also  improved  by  yn_|_  \  •  In  the  case  of  incremental  ISOMAP,  this  amounts 
to  updating  the  geodesic  distances  and  then  applying  subspace  iteration. 

The  out-of-sample  extension  is  faster  because  it  skips  the  improvement  step.  However,  it  is  less 
accurate,  and  cannot  provide  intermediate  embedding  with  good  quality  as  points  are  accumulated. 
Incremental  ISOMAP,  on  the  other  hand,  utilizes  the  new  samples  to  continuously  improve  the 
co-ordinate  estimates.  Out-of-sample  extension  may  be  more  appealing  when  a  large  number  of 
samples  have  been  accumulated  and  the  geodesic  distances  and  xj_, . . . ,  xn  are  reasonably  accurate. 
Even  in  this  case,  though,  the  strategy  of  updating  x^, . . . ,  xn  after  p  new  points  (with  p  >  1)  have 
been  embedded  works  equally  as  well.  The  updating  of  geodesic  distances  and  co-ordinates  occurs 
infrequently  in  this  case,  and  its  amortized  computational  cost  is  very  low. 

Incremental  ISOMAP  is  also  preferable  to  out-of-sample  extension  when  there  is  a  drifting  of  data 
characteristics.  In  out-of-sample  extension,  the  n  points  collected  are  assumed  to  be  representative 
of  all  future  data  points  that  are  likely  to  be  observed.  There  is  no  way  to  capture  the  change  of 
data  characteristics.  In  incremental  ISOMAP,  however,  we  can  easily  maintain  an  embedding  using 
a  window  of  the  points  recently  encountered.  Changes  in  data  characteristics  are  captured  as  the 
geodesic  distances  and  co-ordinate  estimates  are  updated.  Vertex  contraction  should  be  turned  off 
if  incremental  ISOMAP  is  run  in  this  mode,  to  ensure  that  the  effect  of  old  data  points  is  erased. 

3.4.3  Implementation  Details 

The  subspace  iteration  in  section  3. 2. 1.4  requires  that  the  eigenvalues  corresponding  to  the  leading 
eigenvectors  have  the  largest  absolute  values.  This  can  be  violated  if  the  target  inner  product  matrix 
has  a  large  negative  eigenvalue.  To  tackle  this,  we  shift  the  spectrum  and  find  the  eigenvectors  of 
(B+al)  instead  of  B.  Subspace  iteration  on  (B+ctl)  can  proceed  in  almost  the  same  manner,  because 
(B  +  qI)v  =  B  +  av.  While  a  large  value  of  a  guarantees  that  all  shifted  eigenvalues  are  positive,  this 
has  the  adverse  effect  of  reducing  the  rate  of  convergence  of  the  eigenvectors,  because  the  shift  reduces 
the  ratio  between  adjacent  eigenvalues.  We  empirically  set  a  =  max(— 0.7Amjn(B)  —  0.3Ad-th(B),0), 
where  Amjn(B)  and  A^_tj1(B)  denote  the  smallest  (most  negative)  and  the  d-th  largest  eigenvalues, 
respectively.  The  later  is  being  maintained  by  the  incremental  algorithm,  while  the  former  can  be 
found  by,  say,  residual  norm  bounds  or  Gerscligoren  disk  bounds.  In  practice,  Amjn(B)  is  found 
at  the  initialization  stage.  This  estimate  is  updated  only  when  a  large  number  of  data  points  have 
been  accumulated. 

During  the  incremental  learning,  the  neighborhood  graph  may  be  temporarily  disconnected.  A 
simple  solution  is  to  embed  only  the  largest  graph  component.  The  excluded  vertices  are  added 
back  for  embedding  again  when  they  become  reconnected  as  additional  data  points  are  encountered. 
Alternatively,  an  edge  can  be  added  between  the  two  nearest  vertices  to  connect  the  two  disconnected 
components  in  the  neighborhood  graph. 


3.5  Summary 

Nonlinear  dimensionality  reduction  is  an  important  problem  with  applications  in  pattern  recogni¬ 
tion,  computer  vision,  and  machine  learning.  We  have  developed  an  algorithm  for  the  incremental 
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nonlinear  mapping  problem  by  modifying  the  well-known  ISOMAP  algorithm.  The  core  idea  is  to 
efficiently  update  the  geodesic  distances  (a  graph  theoretic  problem)  and  re-estimate  the  eigenvec¬ 
tors  (a  numerical  analysis  problem),  using  the  previous  computation  results.  Our  experiments  on 
synthetic  data  as  well  as  real  world  images  validate  that  the  proposed  method  is  almost  as  accurate 
as  running  the  batch  version,  while  saving  significant  computation  time.  Our  algorithm  can  also  be 
easily  adopted  to  other  manifold  learning  methods  to  produce  their  incremental  versions. 
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Chapter  4 


Simultaneous  Feature  Selection 
and  Clustering 


Hundreds  of  clustering  algorithms  have  been  proposed  in  the  literature  for  clustering  in  different 
applications.  In  this  chapter,  we  examine  a  different  aspect  of  clustering  that  is  often  neglected: 
the  issue  of  feature  selection.  Our  focus  will  be  on  partitional  clustering  by  a  mixture  of  Gaussians, 
though  the  method  presented  here  can  be  easily  generalized  to  other  types  of  mixtures.  We  are 
interested  in  mixture-based  clustering  because  its  statistical  nature  gives  us  a  solid  foundation  for 
analyzing  its  behavior.  Also,  it  leads  to  good  results  in  many  cases.  We  propose  the  concept  of  feature 
saliency  and  introduce  an  expectation-maximization  (EM)  algorithm  to  estimate  it,  in  the  context  of 
mixture-based  clustering.  We  adopt  the  minimum  message  length  (MML)  model  selection  criterion, 
so  the  saliency  of  irrelevant  features  is  driven  towards  zero,  which  corresponds  to  performing  feature 
selection.  The  MML  criterion  and  the  EM  algorithm  are  then  extended  to  simultaneously  estimate 
the  feature  saliencies  and  the  number  of  clusters. 

The  remainder  of  this  chapter  is  organized  as  follows.  We  discuss  the  challenge  of  feature  selection 
in  unsupervised  domain  in  Section  4.1.  In  Section  4.2,  we  review  previous  attempts  to  solve  the 
feature  selection  problem  in  unsupervised  learning.  The  details  of  our  approach  are  presented  in 
Section  4.3.  Experimental  results  are  reported  in  Section  4.4,  followed  by  comments  on  the  proposed 
algorithm  in  Section  4.5.  Finally,  we  conclude  in  Section  4.6. 


4.1  Clustering  and  Feature  Selection 

Clustering,  similar  to  supervised  classification  and  regression,  can  be  benefited  by  using  a  good 
subset  of  the  available  features.  One  simple  example  illustrating  the  corrupting  influence  of  irrelevant 
features  can  be  seen  in  Figure  4.1,  where  the  irrelevant  feature  makes  it  hard  for  the  algorithm  in  [81] 
to  discover  the  two  underlying  clusters.  Feature  selection  has  been  widely  studied  in  the  context  of 
supervised  learning  (see  [101,  26,  122,  151,  153]  and  references  therein,  and  also  section  1.2. 3.1),  where 
the  ultimate  goal  is  to  select  features  that  can  achieve  the  highest  accuracy  on  unseen  data.  Feature 
selection  has  received  comparatively  very  little  attention  in  unsupervised  learning  or  clustering.  One 
important  reason  is  that  it  is  not  at  all  clear  how  to  assess  the  relevance  of  a  subset  of  features  without 
resorting  to  class  labels.  The  problem  is  made  even  more  challenging  when  the  number  of  clusters  is 
unknown,  since  the  optimal  number  of  clusters  and  the  optimal  feature  subset  are  inter-related,  as 
illustrated  in  Figure  4.2  (taken  from  [69]).  Note  that  methods  based  on  variance  (such  as  principal 
components  analysis)  need  not  select  good  features  for  clustering,  as  features  with  large  variance  can 
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Figure  4.1:  An  irrelevant  feature  ( x^ )  makes  it  difficult  for  the  Gaussian  mixture  learning  algorithm 
in  [81]  to  recover  the  two  underlying  clusters.  Gaussian  mixture  fitting  finds  seven  clusters  when 
both  the  features  are  used,  but  identifies  only  two  clusters  when  the  feature  x i  is  used.  The  curves 
along  the  horizontal  and  vertical  axes  of  the  figure  indicate  the  marginal  distribution  of  x i  and  X2 , 
respectively. 


be  independent  of  the  intrinsic  grouping  of  the  data  (see  Figure  4.3).  Another  important  problem 
in  clustering  is  the  determination  of  the  number  of  clusters,  which  clearly  impacts  and  is  influenced 
by  the  feature  selection  issue.  Most  feature  selection  algorithms  (such  as  [36,  151,  209])  involve  a 
combinatorial  search  through  the  space  of  all  feature  subsets.  Usually,  heuristic  (non-exhaustive) 
methods  have  to  be  adopted,  because  the  size  of  this  space  is  exponential  in  the  number  of  features. 
In  this  case,  one  generally  loses  any  guarantee  of  optimality  of  the  selected  feature  subset. 

We  propose  a  solution  to  the  feature  selection  problem  in  unsupervised  learning  by  casting  it 
as  an  estimation  problem,  thus  avoiding  any  combinatorial  search.  Instead  of  selecting  a  subset  of 
features,  we  estimate  a  set  of  real-valued  (actually  in  [0,  1])  quantities  (one  for  each  feature),  which 
we  call  the  feature  saliencies.  This  estimation  is  carried  out  by  an  EM  algorithm  derived  for  the 
task.  Since  we  are  in  the  presence  of  a  model-selection-type  problem,  it  is  necessary  to  avoid  the 
situation  where  all  the  features  are  completely  salient.  This  is  achieved  by  adopting  a  minimum 
message  length  (MML,  [264,  265])  penalty,  as  was  done  in  [81]  to  select  the  number  of  clusters.  The 
MML  criterion  encourages  the  saliencies  of  the  irrelevant  features  to  go  to  zero,  allowing  us  to  prune 
the  feature  set.  Finally,  we  integrate  the  process  of  feature  saliency  estimation  into  the  mixture 
fitting  algorithm  proposed  in  [81],  thus  obtaining  a  method  that  is  able  to  simultaneously  perform 
feature  selection  and  determine  the  number  of  clusters. 

This  chapter  is  based  on  our  journal  publication  in  [163]. 


4.2  Related  Work 

Most  of  the  literature  on  feature  selection  pertains  to  supervised  learning  (see  Section  1.2. 3.1). 
Comparatively,  not  much  work  has  been  done  for  feature  selection  in  unsupervised  learning.  Of 
course,  any  method  conceived  for  supervised  learning  that  does  not  use  the  class  labels  could  be 
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Figure  4.2:  The  number  of  clusters  is  inter-related  with  the  feature  subset  used.  The  optimal  feature 
subsets  for  identifying  3,  2,  and  1  clusters  in  this  data  set  are  {x\},  and  {a^},  respectively. 

On  the  other  hand,  the  optimal  number  of  clusters  for  feature  subsets  {x^},  and  {a^}  are 

also  3,  2,  and  1,  respectively. 


used  for  unsupervised  learning;  this  is  the  case  for  methods  that  measure  feature  similarity  to  detect 
redundant  features,  using,  e.g.,  mutual  information  [221]  or  a  maximum  information  compression 
index  [188].  In  [70,  71],  the  normalized  log-likelihood  and  cluster  separability  are  used  to  evaluate 
the  quality  of  clusters  obtained  with  different  feature  subsets.  Different  feature  subsets  and  different 
numbers  of  clusters,  for  multinomial  model-based  clustering,  are  evaluated  using  marginal  likelihood 
and  cross- validated  likelihood  in  [254].  The  algorithm  described  in  [218]  uses  a  LASSO-based  idea 
to  select  the  appropriate  features.  In  [51],  the  clustering  tendency  of  each  feature  is  assessed  by 
an  entropy  index.  A  genetic  algorithm  is  used  in  [146]  for  feature  selection  in  fc-means  clustering. 
In  [246],  feature  selection  for  symbolic  data  is  addressed  by  assuming  that  irrelevant  features  are 
uncorrelated  with  the  relevant  features.  Reference  [60]  describes  the  notion  of  “category  utility”  for 
feature  selection  in  a  conceptual  clustering  task.  The  CLIQUE  algorithm  [2]  is  popular  in  the  data 
mining  community,  and  it  finds  hyper-rectangular  shaped  clusters  using  a  subset  of  attributes  for  a 
large  database.  The  wrapper  approach  can  also  be  adopted  to  select  features  for  clustering;  this  has 
been  explored  in  our  earlier  work  [82,  165]. 

All  the  methods  referred  to  above  perform  “hard”  feature  selection  (a  feature  is  either  selected  or 
not).  There  are  also  algorithms  that  assign  weights  to  different  features  to  indicate  their  significance. 
In  [190],  weights  are  assigned  to  different  groups  of  features  for  k- means  clustering  based  on  a  score 
related  to  the  Fisher  discriminant.  Feature  weighting  for  fc-means  clustering  is  also  considered  in 
[187],  but  the  goal  there  is  to  find  the  best  description  of  the  clusters,  after  they  are  identified. 
The  method  described  in  [204]  can  be  classified  as  learning  feature  weights  for  conditional  Gaussian 
networks.  An  EM  algorithm  based  on  Bayesian  shrinking  is  proposed  in  [100]  for  unsupervised 
learning. 
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Figure  4.3:  Deficiency  of  variance-based  method  for  feature  selection.  Feature  x\,  although  it 
explains  more  data  variance  than  feature  x 2 ,  is  spurious  for  the  identification  of  the  two  clusters  in 
this  data  set. 


4.3  EM  Algorithm  for  Feature  Saliency 

In  this  section,  we  propose  an  EM  algorithm  for  performing  mixture-based  (or  model-based)  clus¬ 
tering  with  feature  selection.  In  mixture-based  clustering,  each  data  point  is  modelled  as  having 
been  generated  by  one  of  a  set  of  probabilistic  models  [125,  183].  Clustering  is  then  done  by  learning 
the  parameters  of  these  models  and  the  associated  probabilities.  Each  pattern  is  assigned  to  the 
mixture  component  that  most  likely  generated  it.  Although  the  derivations  below  refer  to  Gaussian 
mixtures,  they  can  be  generalized  to  other  types  of  mixtures. 


4.3.1  Mixture  Densities 

A  finite  mixture  density  with  k  components  is  defined  by 

k 

p( y)  =  J2  aj  (4-1) 

3= 1 

where  V j,ctj  >  0;  aj  =  1>  each  9  j  is  the  set  of  parameters  of  the  j- th  component  (all  components 
are  assumed  to  have  the  same  form,  e.g.,  Gaussian);  and  9  =  {9^, ...,  0^,,  oq, aj,}  will  denote 
the  full  parameter  set.  The  goal  of  mixture  estimation  is  to  infer  9  from  a  set  of  n  data  points 
y  =  {y^,  ...,y n},  assumed  to  be  samples  of  a  distribution  with  density  given  by  (4.1).  Each  y,  is  a 
d-dimensional  feature  vector  [y^i, ...,  Hid}1  ■  In  the  sequel,  we  will  use  the  indices  i,  j  and  l  to  run 
through  data  points  (1  to  n),  mixture  components  (1  to  k),  and  features  (1  to  d),  respectively. 

As  is  well  known,  neither  the  maximum  likelihood  (ML)  estimate, 

0ml  =  argmax{logp(y|0)}  , 

C 1 
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nor  the  maximum  a  posteriori  (MAP)  estimate  (given  some  prior  p(0 )) 

0map  =  arg  max  {log p(y\0)  +  log p(0)}  , 

0 

can  be  found  analytically.  The  usual  choice  is  the  EM  algorithm,  which  finds  local  maxima  of  these 
criteria  [183].  This  algorithm  is  based  on  a  set  Z  =  {z^,  —,zn}  of  n  missing  (latent)  labels,  where 
zi  =  lzi  1’  zik\'  w^h  Zjj  =  1  and  z ^  =  0,  for  p  ^  j,  meaning  that  y,j  is  a  sample  of  p(-\0  j).  For 
brevity  of  notation,  sometimes  we  write  Z;L  =  j  for  such  z,p  The  complete  data  log-likelihood,  i.e., 
the  log-likelihood  if  Z  were  observed,  is 

n  k 

logp(y,Z\0)  =  ^2^2  Zij\og  ajpiy^Gj)  .  (4.2) 

*=1.7=1 

The  EM  algorithm  produces  a  sequence  of  estimates  {0(t),  t  =  0,1,2,...}  using  two  alternating 
steps: 

•  E-step:  Compute  W  =  E[Z\y,  0(t)},  the  expected  value  of  the  missing  data  given  the  current 
parameter  estimate,  and  plug  it  into  log  p(y,  Z\0),  yielding  the  so-called  Q-function  Q(0,0(t))  = 
\ogp(y,  W\  0).  Since  the  elements  of  Z  are  binary,  we  have 

r  ~  i  r  S«(t)  p(yi\0 tU)) 

wi,j  =  E  [*ij\y,  e(t)\  =  Pr  [zij  =  !|  =  —\T - - •  (4'3) 

J2  Sjit)  piyifikW 

3= 1 

Notice  that  aj  is  the  a  priori  probability  that  z^j  =  1  (i.e.,  that  belongs  to  cluster  j),  while  w^j 
is  the  corresponding  a  posteriori  probability,  after  observing  y^ . 

•  M-step:  Update  the  parameter  estimates, 

0(t  +  1)  =  arg  max  {Q(0,  0(t))  +  log p(0)}, 

0 

in  the  case  of  MAP  estimation,  or  without  logp(0)  in  the  ML  case. 

4.3.2  Feature  Saliency 

In  this  section  we  define  the  concept  of  feature  saliency  and  derive  an  EM  algorithm  to  estimate 
its  value.  We  assume  that  the  features  are  conditionally  independent  given  the  (hidden)  component 
label,  that  is, 

k  k  d 

p( y\°)  =  aj  p(y\ej)  =  aj  II  p(yi\9j0’  (4-4) 

3= 1  '  3= 1  1= 1 

where  p(-\dji)  is  the  pdf  of  the  l- th  feature  in  the  j-th  component.  This  assumption  enables  us  to 
utilize  the  power  of  the  EM  algorithm.  In  the  particular  case  of  Gaussian  mixtures,  the  conditional 
independence  assumption  is  equivalent  to  adopting  diagonal  covariance  matrices,  which  is  a  common 
choice  for  high-dimensional  data,  such  as  in  naive  Bayes  classifiers  and  latent  class  models,  as  well 
as  in  the  emission  densities  of  continuous  hidden  Markov  models. 

Among  different  definitions  of  feature  irrelevancy  (proposed  for  supervised  learning),  we  adopt 
the  one  suggested  in  [210,  254],  which  is  suitable  for  unsupervised  learning:  the  l- th  feature  is 
irrelevant  if  its  distribution  is  independent  of  the  class  labels,  i.e.,  if  it  follows  a  common  density, 


78 


z 


z 


(a)  <j> i  =l,(j>2  =  1,  (/>3  =  0,  (p4  =  1 


(b)  0!  =  0,  <^2  =  1,  03  =  1,  04  =  o 


Figure  4.4:  An  example  graphical  model  for  the  probability  model  in  Equation  (4.5)  for  the  case  of 
four  features  ( d  =  4)  with  different  indicator  variables.  4>\  =  1  corresponds  to  the  existence  of  an 
arc  from  z  to  ,  and  <j>i  =  0  corresponds  to  its  absence. 


denoted  by  q{y^\\^).  Let  $  =  (c/q, ...,  <p(j)  be  a  set  of  d  binary  parameters,  such  that  <j>i  =  1  if  feature 
l  is  relevant  and  4>i  =  0,  otherwise.  The  mixture  density  in  (4.4)  can  then  be  re-written  as 

k  d 

p(y|<M«j}>{fy}>{M)  =  E  “j  II  (4.5) 

j= i  1=1 

A  related  model  for  feature  selection  in  supervised  learning  has  been  considered  in  [197,  210].  Intu¬ 
itively,  $  determines  which  edges  exist  between  the  hidden  label  z  and  the  individual  features  y ^  in 
the  graphical  model  illustrated  in  Figure  4.4,  for  the  case  d  =  4. 


Our  notion  of  feature  saliency  is  summarized  in  the  following  steps:  (i)  we  treat  the  <f>jf  s  as 
missing  variables;  (ii)  we  define  the  feature  saliency  as  pi  =  p(4>i  =  1),  the  probability  that  the  Z-th 
feature  is  relevant.  This  definition  makes  sense,  as  it  is  difficult  to  know  for  sure  that  a  certain 
feature  is  irrelevant  in  unsupervised  learning.  The  resulting  model  (likelihood  function)  is  written 
as 

k  d 

p(  y|0)  =  E«j  n  (pip(vi\0jl)  +  (1  -  PlMvil^l)),  (4.6) 

3=1  1=1 

where  Q  =  {{ccj},  {A;},  {p{\}  is  the  set  of  all  the  parameters  of  the  model. 


Equation  (4.6)  can  be  derived  as  follows.  We  treat  p{  =  p(</>i  =  1)  as  a  set  of  parameters 
to  be  estimated  (the  feature  saliencies).  We  assume  the  (pf  s  are  mutually  independent  and  also 
independent  of  the  hidden  component  label  z  for  any  pattern  y.  Thus, 

p(y,  $)  =p(y|$)p(<F) 

=  (Eaj  Ii (pCyi  1^- «))^ («(yi I Af ))1-^)  Tlppa-PO1-*1 

j=  1  Z=1  Z=1  (4.7) 

k  d 

=  Eaj  f[(pip(yi\°j  i)fl  (i1  -  PiMvil^i))1^^1  ■ 

3=1  1=1 
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Figure  4.5:  An  example  graphical  model  showing  the  mixture  density  in  Equation  (4.6).  The 
variables  z,  <j)\,  $2,  ^3,  ^4  are  “hidden”  and  only  y  \ ,  y^.  2/3  j  !J4  are  observed. 

The  marginal  density  for  y  is 

d  d 

p(  y)  =  Ep(y’$)  =  E  aj^n(ftlP(Vl\ejl))*l(0--Pl)Q(Vl\xi))1  * 1 

$  j= 1  4>  1=1 

k  d  1 

=  E  aj  II  E  (piPiyiWjOfiai-pMviW)1^1  (4.8) 

j= 1  i=l^=0 

k  d 

=  E  aj  n(p(yi\ej i)pi  +  9(»jIaj)(1  -  a»z)). 

j=l  i=l 

which  is  just  Equation  (4.6).  Another  way  to  see  how  Equation  (4.6)  is  obtained  is  to  notice  that 
the  conditional  density  of  y 1  given  z  =  j  and  <^,  \p{yi\Oji)]^l[q(yi\Xi)]^  E  can  be  written  as 
+  (1  —  ^i)  q(yi\^l),  because  is  binary.  Taking  the  expectation  with  respect  to  (j>i  and 
z  leads  to  Equation  (4.6). 

The  form  of  q(. |.)  reflects  our  prior  knowledge  about  the  distribution  of  the  non-salient  features. 
In  principle,  it  can  be  any  1-D  distribution  ( e.g .,  a  Gaussian,  a  student-t,  or  even  a  mixture).  We 
shall  limit  g(.|.)  to  be  a  Gaussian,  since  this  leads  to  reasonable  results  in  practice. 

Equation  (4.6)  has  a  generative  interpretation.  As  in  a  standard  finite  mixture,  we  first  select 
the  component  label  j  by  sampling  from  a  multinomial  distribution  with  parameters  (oq, . . .  ,ad 
Then,  for  each  feature  l  =  1, d,  we  flip  a  biased  coin  whose  probability  of  getting  a  head  is  pi:  if 
we  get  a  head,  we  use  the  mixture  component  p(.\9ji)  to  generate  the  Z-tli  feature;  otherwise,  the 
common  component  q(.\Xi)  is  used.  A  graphical  model  representation  of  Equation  (4.6)  is  shown  in 
Figure  4.5  for  the  case  d  =  4. 


4.3.2. 1  EM  Algorithm 

By  treating  Z  (the  hidden  class  labels)  and  $  (the  feature  indicators)  as  hidden  variables,  one  can 
derive  an  EM  algorithm  for  parameter  estimation.  The  complete-data  log-likelihood  for  the  model 
in  Equation  (4.6)  is 


p(yvzi  =  3^)  =  oij'[[(piP(yii\dji))^l(0--Pi)Q(yii\^i))1  E  (4-9) 

Z=1 
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Define  the  following  quantities: 


wij  =  p(h  =  j|y*)»  uij  i  =  p(h  =  h  <l>i  =  i|y*)»  i  =  p{zt  =  j,  ^  =  o|yi). 

They  are  calculated  using  the  current  parameter  estimate  0now.  Note  that  ( u ^  ■  i+u^ji)  —  wij  and 
Ej-i  u’ij  =  n-  The  expected  complete  data  log-likelihood  based  on  0now  is 


E gnow  [log p(y,  2,^)} 

=  E  p(h  =  j^\yi)(lo&aj  +  J2(Ml°sp(yu\ej  i)  +  1°spi) 

i,j,$  l 

+  (1  -  <fy)(logg(j/^|Aj)  +  log(l  -  Pl)))^j 

_  1 

=Y,p(H=j\yi)l°saj+Yl'52  E  p(zi  =  i>  <t>i\yi)(Mlosp(yu\eji)  +  \ogpi) 

i,j  i,j  l  </>;= 0 

+  (1  -  <l>l)^ogq{yu\Xi)  +  log(l  -  pi))) 

=E(E  wij )  log  aj  +  EE“t  jllogp(Vil\OjO  +  E5>*  1 1  log q{yu\Xl) 

j  i  j,l  i  l  i,j 

S,  >  ^  ■  v"  **  > 

part  1  Part  2  part  3 

+  E  (lo§  Pi  E  I  +  M1  -  P/)  E  vij  l )  • 

l  i,j  i,3 

' - v - " 

part  4 


The  four  parts  in  the  equation  above  can  be  maximized  separately.  Recall  that  the  densities  p(.) 
and  g(.)  are  univariate  Gaussian  and  are  characterized  by  their  means  and  variances.  As  a  result, 
maximizing  the  expected  complete  data  log-likelihood  leads  to  the  M-step  in  Equations  (4.18)-(4.23). 
For  the  E-step,  observe  that 


P{<t>l  =  %  =j,Yi) 


p^i  =  i,yj\zi  =  j) 
p(yi\zi  =j) 

Pip(yi\°ji)Ui/^i{pi/p{yi/\^ji/)  +  {1-Pi/)q{yi/\\/)) 

I[ii{piip{yii\ejii)  +  (i  -  Pii)yi.Vii\\i)) 

_ pip(yi\0ji) _ _ 

pip(yi\dji)  +  (l~  pMviW)  ~  ciji 


Therefore,  equation  (4.16)  follows  because 


O' 2, 1  l 

uij,i  =p^i  =  %  =j,yi)p(zi  =  j|y*)  =  - — 

iji 


So,  the  EM  algorithm  is 


(4.11) 
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•  E-step:  Compute  the  following  quantities: 


aijl  =P^l  =  ^Vn\zi  =3)  =PlP(Vil\0jl) 

(4.12) 

bijl  =  P(<f>l  =  °’Vil\zi  =  j)  =  (!  -  Pi)  PtVilW) 

(4.13) 

cij  l  =  P(Vil\zi  =  i)  =  aijl  +  bijl 

(4.14) 

ajY\lcijl 

-j|y •)-Sj„3.njcy, 

(4.15) 

j  1 

uij  i  =  p(4>i  =  1’H  =  J\y i)  =  . '  wij 

ij  i 

(4.16) 

vij  i  =  p(h  =  °>  h  =  j\yi)  =  wij  -  uij  i 

(4.17) 

•  M-step:  Re-estimate  the  parameters  according  to  following  expressions: 


Mean  in  9j  j 

Var  in  0  •  j 
Mean  in  Xj 

Var  in  Xj 

PI 


Xi  wij  X*  wij 
Xij  wij  n 

Xi  uij  l  Vj  l 
Xi  uijl 

Eiuijl  ( Vjl  -  (Mea^ThTd^))2 

X  i  uij  l 

Xi  (Xj  vij  0  Vi  l 
Xij  wij  l 

Xi(Xj  vijl)(Vil  ~  (Mean  in  Xl ))2 
X  jj  vij  l 

Xi,j  uij  l  _  Xi,j  uij  l 
X i,j  uij  l  +  Xi,j  vij  l  n 


(4.18) 

(4.19) 

(4.20) 

(4.21) 

(4.22) 

(4.23) 


In  these  equations,  the  variable  measures  how  important  the  i-th  pattern  is  to  the  j-th 

ij  i 

component,  when  the  i-th  feature  is  used.  It  is  thus  natural  that  the  estimates  of  the  mean  and  the 
variance  in  6ji  are  weighted  sums  with  weight  Ujj  p  A  similar  relationship  exists  between  Yj  Vjj  i 
and  Xj  .  The  term  Yij  uij  l  can  be  interpreted  as  how  likely  it  is  that  (j>i  equals  one,  explaining  why 
the  estimate  of  is  proportional  to  Yij  ui  j  /• 


4.3.3  Model  Selection 

Standard  EM  for  mixtures  exhibits  some  weaknesses,  which  also  affect  the  EM  algorithm  introduced 
above:  it  requires  knowledge  of  k  (the  number  of  mixture  components),  and  a  good  initialization 
is  essential  for  reaching  a  good  local  optimum  (not  to  mention  the  global  optimum).  To  overcome 
these  difficulties,  we  adopt  the  approach  in  [81],  which  is  based  on  the  minimum  message  length 
(MML)  criterion  [265,  264]. 

The  MML  criterion  for  our  model  consists  of  minimizing,  with  respect  to  9,  the  following  cost 
function  (after  discarding  the  order  one  term) 

i.i  d  k  d 

-logp(V|6»)  +  — —  logn+  T-  l°s(najPl)  +  |  ~  Pi))'  (4-24) 

1=1  j=l  1=1 
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where  r  and  s  are  the  number  of  parameters  in  Oj  j  and  A^,  respectively.  If  p(.|.)  and  q(.\.)  are 
univariate  Gaussians  (arbitrary  mean  and  variance),  r  =  s  =  2.  Equation  (4.24)  is  derived  by 
considering  the  minimum  message  length  (MML)  criterion  (see  [81]  for  details  and  references) 

6  =  arg  nun  j—  logp(0)  -  \ogp(y\6)  +  ilog|I(0)|  +  |(1  +  log -^)j,  (4.25) 


O 

where  0  is  the  set  of  parameter  of  the  model,  c  is  the  dimension  of  0,  1(0 )  =  —  E[Dq  logp()V|0)] 
is  the  (expected)  Fisher  information  matrix  (the  negative  expected  value  of  the  Hessian  of  the  log- 
likelihood),  and  1 1(0)  |  is  the  determinant  of  1(0).  The  information  matrix  for  the  model  (4.6)  is  very 
difficult  to  obtain  analytically.  Therefore,  as  in  [81],  we  approximate  it  by  the  information  matrix  of 
the  complete  data  log-likelihood,  lc(0).  By  differentiating  the  logarithm  of  equation  (4.9),  we  can 
show  that 


Ic(0) 


block-diag 


M , 


Pd^-Pd)1 


a101I(6'll)>  •  •  •  >  al Pd^ld)’ 


a201I(^2l)’  •  •  •  >  a2Pdl(e2d)'  •  •  •  >  akPlY°2l)^  •  •  •  >  akPdl(°2d)' 

(l-p1)l(X1),...,(l-pd)l(Xd)}, 


(4.26) 


where  M.  is  the  information  matrix  of  the  multinominal  distribution  with  parameters  (aj, . . .  ,  a^,). 
The  size  of  1(0)  is  (k  +  d  +  kdr  +  ds),  where  r  and  s  are  the  number  of  parameters  in  Oj  j  and  A  j, 
respectively.  Note  that  (p^(  1  —  p/))— ^  is  the  Fisher  information  of  a  Bernoulli  distribution  with 
parameter  p ^ .  Thus  we  can  write 


log  |lc(0)| 


d  k  d 

logI({aj})  +  logr(^)  +r  loS iajPl) 

1=1  j=H= 1 


k  d 


d 


d 


+  +  Pi)  +  J2logl^i)- 

j=ll=l  1=1  1=1 


(4.27) 


For  the  prior  densities  of  the  parameters,  we  assume  that  different  groups  of  parameters  are  inde¬ 
pendent.  Specifically,  {aj},  pi  (for  different  values  of  l),  6ji  (for  different  values  of  j  and  l)  and 
A i  (for  different  values  of  l)  are  independent.  Furthermore,  since  we  have  no  knowledge  about  the 
parameters,  we  adopt  non-informative  Jeffrey’s  priors  (see  [81]  for  details  and  references),  which  are 
proportional  to  the  square  root  of  the  determinant  of  the  corresponding  information  matrices.  When 
we  substitute  p(0)  and  |I(0)|  into  equation  (4.25),  and  drop  the  order-one  term,  we  obtain  our  final 
criterion,  which  is  equation  (4.24). 


From  a  parameter  estimation  viewpoint,  Equation  (4.24)  is  equivalent  to  a  maximum  a  posteriori 
(MAP)  estimate, 

rd  k  d  ^  d 

0  =  arg  max j  log  p  ()V  1 0 )  -  y  lo§aj  “  f  lo§(1  ~  Pi)  ~  y  lo§0/}’  (4-28) 

0  ^  1=1  1=1  ^  1=1 
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with  the  following  (Dirichlet-type,  but  improper)  priors  on  the  aj' s  and  pfs: 


p{ai,...,ak) 


P(Pb---,Pd) 


TT  ~rcV2 

^  ll  j  ’ 

3= 1 

°c  wpr'^i-Pi)- 
i=i 


n7rk/2n  -  nA-*/ 2 


Since  these  priors  are  conjugate  with  respect  to  the  complete  data  likelihood,  the  EM  algorithm 
undergoes  a  minor  modification:  the  M-step  Equations  (4.18)  and  (4.23)  are  replaced  by 


^  max(Ei  wij  —  0) 

J  Ejmax(Eiwij  -  ^5o) 

max(Ei,?'  u^i  —  £f,0) 

^  =  - Z — “ — - - — 

max(Ej ,j  uijl  0)  +  max(Ei  J  Vijl  -  f ,  0) 


(4.29) 

(4.30) 


In  addition  to  the  log-likelihood,  the  other  terms  in  Equation  (4.24)  have  simple  interpretations. 
The  term  log  n  is  a  standard  MDL-type  [215]  parameter  code-length  corresponding  to  k  de¬ 
values  and  d  p i  values.  For  the  /-th  feature  in  the  j- th  component,  the  “effective”  number  of  data 
points  for  estimating  9ji  is  najpp  Since  there  are  r  parameters  in  each  9jj ,  the  corresponding 
code- length  is  log (npiaj).  Similarly,  for  the  /-th  feature  in  the  common  component,  the  number 
of  effective  data  points  for  estimation  is  n(l  —  pj).  Thus,  there  is  a  term  ^  log(n(l  —  pi))  in  (4.24) 
for  each  feature. 

One  key  property  of  Equations  (4.29)  and  (4.30)  is  their  pruning  behavior,  forcing  some  of  the 
ctj  to  go  to  zero  and  some  of  the  pi  to  go  to  zero  or  one.  This  pruning  behavior  also  has  the  indirect 
benefit  of  protecting  us  from  almost  singular  covariance  matrix  in  a  mixture  component:  the  weight 
of  such  a  component  is  usually  very  small,  and  the  component  is  likely  to  be  pruned  in  the  next  few 
iterations.  Concerns  that  the  message  length  in  (4.24)  may  become  invalid  at  these  boundary  values 
can  be  circumvented  by  the  arguments  in  [81]:  when  p ^  goes  to  zero,  the  /-th  feature  is  no  longer 
salient  and  pi  and  9^, . . . ,  9j^i  are  removed;  when  p ^  goes  to  1,  and  pi  are  dropped. 

Finally,  since  the  model  selection  algorithm  determines  the  number  of  components,  it  can  be 
initialized  with  a  large  value  of  k,  thus  alleviating  the  need  for  a  good  initialization,  as  shown  in 
[81].  Because  of  this,  a  component- wise  version  of  EM  can  be  adopted  [37,  81].  The  algorithm  is 
summarized  in  Algorithm  4.1. 


4.3.4  Post-processing  of  Feature  Saliency 

The  feature  saliencies  generated  by  Algorithm  4.1  attempt  to  find  the  best  way  to  model  the  data, 
using  different  component  densities.  Alternatively,  we  can  consider  feature  saliencies  that  best 
discriminate  between  different  components.  This  can  be  more  appropriate  if  the  ultimate  goal  is  to 
discover  well-separated  clusters.  If  the  components  are  well-separated,  each  pattern  is  likely  to  be 
generated  by  one  component  only.  Therefore,  one  quantitative  measure  of  the  separability  of  the 
clusters  is 

n 

J=  ^2l°SP(zi  =  ti\Yi),  (4-31) 

i= 1 
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Input:  Training  data  y  =  {y^, . . . ,  yn } ■  minimum  number  of  components  km^n 
Output:  Number  of  components  /c,  mixture  parameters  {Sj[},  {&j},  parameters  of  common  distri¬ 
bution  { }  and  feature  saliencies  {/^} 

{Initialization} 

Set  the  parameters  of  a  large  number  of  mixture  components  randomly 

Set  the  common  distribution  to  cover  all  data 

Set  the  feature  saliency  of  all  features  to  0.5 

{Initialization  ends;  main  loop  begins} 

while  k  >  kmin  do 

while  not  reach  local  minimum  do 

Perform  E-step  according  to  Equations  (4.12)  to  (4.17) 

Perform  M-step  according  to  Equations  (4.19)  to  (4.22),  (4.29)  and  (4.30) 

If  a.j  becomes  zero,  the  j-th  component  is  pruned. 

If  pi  becomes  1,  g(y^|A^)  is  pruned. 

If  pi  becomes  0,  p(y^|6f^)  are  pruned  for  all  j 
end  while 

Record  the  current  model  parameters  and  its  message  length 
Remove  the  component  with  the  smallest  weight 

end  while 

Return  the  model  parameters  that  yield  the  smallest  message  length 

Algorithm  4.1:  The  unsupervised  feature  saliency  algorithm. 


where  tj  =  argma xj  p(z^  =  J  |  y ^ )  -  Intuitively,  J  is  the  sum  of  the  logarithms  of  the  posterior 
probabilities  of  the  data,  assuming  that  each  data  point  was  indeed  generated  by  the  component 
with  maximum  posterior  probability  (an  implicit  assumption  in  mixture-based  clustering).  J  can 
then  be  maximized  by  varying  pi  while  keeping  the  other  parameters  fixed. 

Unlike  the  MML  criterion,  J  cannot  be  optimized  by  an  EM  algorithm.  However,  by  defining 


_  p(yu\0j  i)  -  q(vi\xi) 
llj  PiP(vu\0ji)  +  (x  -  Pi)Qiyu\xiV 

k 

9%  l  ~  wij  Ij ' 

3= 1 


we  can  show  that 


dpi 


dpidpr 


log  wij  =hilj~9ib 

n  k 

■  loS  wij  =  {si  l9im  -  Y.  wijhi  ljhimj ) »  for  1  ^ 


i=  1 
n 


3= 1 


2  log  Wij  -  Y  (fy  l  hi  lj ) ' 
opl  i=  1 


The  gradient  and  Hessian  of  J  can  then  be  calculated  accordingly,  if  we  ignore  the  dependence  of 
tj  on  pp  We  can  then  use  any  constrained  non-linear  optimization  software  to  find  the  optimal 
values  of  pi  in  [0, 1].  We  have  used  the  MATLAB  optimization  toolbox  in  our  experiments.  After 
obtaining  the  set  of  optimized  pi ,  we  fix  them  and  estimate  the  remaining  parameters  using  the  EM 
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algorithm. 


4.4  Experimental  Results 


4.4.1  Synthetic  Data 

The  first  synthetic  data  set  consisted  of  800  data  points  from  a  mixture  of  four  equiprobable  Gaus- 
sians  .A/^m^,!),*  =  {1,2, 3, 4},  where  = 


m2  = 


m3  = 


1114  = 


7 

10 


(Fig¬ 


ure  4.6(a)).  Eight  “noisy”  features  (sampled  from  a  Af(0, 1)  density)  were  then  appended  to  this 
data,  yielding  a  set  of  800  10-dinrensional  patterns.  We  ran  the  proposed  algorithm  10  times,  each 
initialized  with  k  =  30;  the  common  component  was  initialized  to  cover  the  entire  set  of  data,  and 
the  feature  saliency  values  were  initialized  at  0.5.  A  local  minimum  was  reached  if  the  change  in 
description  length  between  two  iterations  was  less  than  10— ^ .  A  typical  run  of  the  algorithm  is 
shown  in  Figure  4.6.  In  all  the  ten  random  runs  with  this  mixture,  the  four  components  were  always 
correctly  identified.  The  saliencies  of  all  the  ten  features,  together  with  their  standard  deviations  (er¬ 
ror  bars),  are  shown  in  Figure  4.7(a).  We  can  conclude  that,  in  this  case,  the  algorithm  successfully 
locates  the  true  clusters  and  correctly  assigns  the  feature  saliencies. 

In  the  second  experiment,  we  considered  the  Trunk  data  [122,  252],  consisting  of  two  20- 
dimensional  Gaussians  ^(mp  I)  and  A/"(m2, 1),  where  1114  =  (1,  — 7=, . . . ,  — ^=),  m2  =  —1114.  Data 
were  obtained  by  sampling  5000  points  from  each  of  these  two  Gaussians.  Note  that  the  features  are 

_ n 

arranged  in  descending  order  of  relevance.  As  above,  the  stopping  threshold  was  set  to  10  ‘  and 
the  initial  value  of  k  was  set  to  30.  In  all  the  10  runs  performed,  the  two  components  were  always 
detected.  The  feature  saliencies  are  shown  in  Figure  4.7(b).  The  lower  the  rank  number,  the  more 
important  was  the  feature.  We  can  see  the  general  trend  that  as  the  feature  number  increases,  the 
saliency  decreases,  in  accordance  with  the  true  characteristics  of  the  data. 


4.4.2  Real  data 

We  tested  our  algorithm  on  several  data  sets  with  different  characteristics  (Table  4.1).  The  wine 
recognition  data  set  (wine)  contains  results  of  chemical  analysis  of  wines  grown  in  different  cultivars. 
The  goal  is  to  predict  the  type  of  a  wine  based  on  its  chemical  composition;  it  has  178  data  points,  13 
features,  and  3  classes.  The  Wisconsin  diagnostic  breast  cancer  data  set  (wdbc)  was  used  to  obtain 
a  binary  diagnosis  (benign  or  malignant)  based  on  30  features  extracted  from  cell  nuclei  presented 
in  an  image;  it  has  576  data  points.  The  image  segmentation  data  set  (image)  contains  2320  data 
points  with  19  features  from  seven  classes;  each  pattern  consists  of  features  extracted  from  a  3  x  3 
region  taken  from  7  types  of  outdoor  images:  brickface,  sky,  foliage,  cement,  window,  path,  and 
grass.  The  texture  data  set  (texture)  consists  of  4000  19-dimensional  Gabor  filter  features  from  a 
collage  of  four  Brodatz  textures  [127].  A  data  set  (zernike)  of  47  Zernike  moments  extracted  from 
images  of  handwriting  numerals  (as  in  [126])  is  also  used;  there  are  200  images  for  each  digit,  totaling 
2000  patterns.  The  data  sets  wine,  wdbc,  image,  and  zernike  are  from  the  UCI  machine  learn¬ 
ing  repository  (http://www.ics.uci.edu/~mlearn/MLSummary.html).  This  repository  has  been 
extensively  used  in  pattern  recognition  and  machine  learning  studies.  Normalization  to  zero  mean 
and  unit  variance  is  performed  for  all  but  the  texture  data  set,  so  as  to  make  the  contribution  of 
different  features  roughly  equal  a  priori.  We  do  not  normalize  the  texture  data  set  because  it  is 
already  approximately  normalized.  Since  these  data  sets  were  collected  for  supervised  classification, 
the  class  labels  are  not  involved  in  our  experiment,  except  for  the  evaluation  of  clustering  results. 

Each  data  set  was  first  randomly  divided  into  two  halves:  one  for  training,  another  for  testing. 
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Table  4.1:  Real  world  data  sets  used  in  the  experiment.  Each  data  set  has  n  data  points  with  d 
features  from  c  classes.  One  feature  with  a  constant  value  in  image  is  discarded.  Normalization  is 
not  needed  for  texture  because  the  features  have  comparable  variances. 


Abbr. 

Full  name 

n 

d 

c 

Normalized? 

wine 

wine  recognition 

178 

13 

3 

yes 

wdbc 

Wisconsin  diagnostic  breast  cancer 

569 

30 

2 

yes 

image 

image  segmentation 

2320 

18 

7 

yes 

texture 

Texture  data  set 

4000 

19 

4 

no 

zernike 

Zernike  moments  of  digit  images 

2000 

47 

10 

yes 

Table  4.2:  Results  of  the  algorithm  over  20  random  data  splits  and  algorithm  initializations.  “Error” 
corresponds  to  the  mean  of  the  error  rates  on  the  testing  set  when  the  clustering  results  are  compared 
with  the  ground  truth  labels,  c  denotes  the  number  of  Gaussian  components  estimated.  Note  that  the 
post-processing  does  not  change  the  number  of  Gaussian  components.  The  numbers  in  parenthesis 
are  the  standard  deviation  of  the  corresponding  quantities. 


Algorithm  4.1 

With  post-processing 

Using  all  the  features 

error  (in  %) 

c 

error  (in  %) 

error  (in  %) 

c 

wine 

6.61  (3.91) 

3.1  (0.31) 

6.61  (3.23) 

8.06  (3.73) 

3  (0) 

wdbc 

9.55  (1.99) 

5.65  (0.75) 

9.35  (2.07) 

10.09  (2.00) 

2.70  (0.57) 

image 

20.19  (1.54) 

23.1  (1.74) 

20.28  (1.60) 

32.84  (5.1) 

13.8  (1.94) 

texture 

4.04  (0.76) 

36.17  (1.19) 

4.02  (0.74) 

4.85  (0.98) 

31.42  (2.81) 

zernike 

52.09  (2.52) 

11.3  (0.98) 

51.99  (2.32) 

56.42  (3.62) 

10  (0) 

Algorithm  4.1  was  run  on  the  training  set.  The  feature  saliency  values  can  be  refined  as  described  in 
Section  4.3.4.  We  evaluated  the  results  by  interpreting  the  fitted  Gaussian  components  as  clusters 
and  compared  them  with  the  ground  truth  labels.  Each  data  point  in  the  test  set  was  assigned  to 
the  component  that  most  likely  generated  it,  and  the  pattern  was  classified  to  the  class  represented 
by  the  component.  We  then  computed  the  error  rates  on  the  test  data.  For  comparison,  we  also  ran 
the  mixture  of  Gaussian  algorithm  in  [81]  using  all  the  features,  with  the  number  of  classes  of  the 
data  set  as  a  lower  bound  on  the  number  of  components.  This  gives  us  a  fair  ground  for  comparing 
Gaussian  mixtures  with  and  without  feature  saliency.  In  order  to  ensure  that  we  had  enough  data 
with  respect  to  the  number  of  features  for  the  algorithm  in  [81],  the  covariance  matrices  of  the 
mixture  components  were  restricted  to  be  diagonal,  but  were  different  for  different  components. 
The  entire  procedure  was  repeated  20  times  with  different  splits  of  data  and  initializations  of  the 
algorithm.  The  results  are  shown  in  Table  4.2.  We  also  show  the  feature  saliency  values  of  different 
features  in  different  runs  as  gray-level  image  maps  in  Figure  4.9.  For  illustrative  purpose,  we  contrast 
the  clusters  obtained  for  the  image  data  set  with  the  true  class  labels  in  Figure  4.8,  after  using  PCA 
to  project  the  data  into  3D. 

From  Table  4.2,  we  can  see  that  the  proposed  algorithm  reduces  the  error  rates  when  compared 
with  using  all  the  features  for  all  five  data  sets.  The  improvement  is  more  significant  for  the  image 
data  set,  but  this  may  be  due  to  the  increased  number  of  components  estimated.  The  high  error 
rate  for  zernike  is  due  to  the  fact  that  digit  images  are  inherently  more  difficult  to  cluster:  for 
example,  “4”  can  be  written  in  a  manner  very  similar  to  “9” ,  and  it  is  difficult  for  any  unsupervised 
learning  algorithm  to  distinguish  among  them.  The  post-processing  can  increase  the  “contrast”  of 
the  feature  saliencies,  as  the  image  maps  in  Figure  4.9  show,  without  deteriorating  the  accuracy.  It 
is  easier  to  perform  “hard”  feature  selection  using  these  post-processed  feature  saliencies,  if  this  is 
required  for  the  application. 
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4.5 


Discussion 


4.5.1  Complexity 

The  major  computational  load  in  the  proposed  algorithm  is  in  the  E-step  and  the  M-step.  Each 
E-step  iteration  computes  0(ndk)  quantities.  As  each  quantity  can  be  computed  in  constant  time, 
the  time  complexity  for  E-step  is  also  0(ndk).  Similarly,  the  M-step  takes  0(ndk)  time.  The  total 
amount  of  computation  depends  on  the  number  of  iterations  required  for  convergence. 

At  a  first  sight,  the  amount  of  computation  seems  to  be  demanding.  However,  a  close  examination 
reveals  that  each  iteration  (E-step  and  M-step)  of  the  standard  EM  algorithm  also  takes  0{ndk) 
time.  The  value  of  k  in  the  standard  EM,  though,  is  usually  smaller,  because  the  proposed  algorithm 
starts  with  a  larger  number  of  components.  The  number  of  iterations  required  for  our  algorithm 
is  also  in  general  larger  because  of  the  increase  in  the  number  of  parameters.  Therefore,  it  is  true 
that  the  proposed  algorithm  takes  more  time  than  the  standard  EM  algorithm  with  one  parameter 
setting.  However,  the  proposed  algorithm  can  determine  the  number  of  clusters  as  well  as  the  feature 
subset.  If  we  want  to  achieve  the  same  goal  with  the  standard  EM  algorithm  using  a  wrapper 
approach,  we  need  to  re-run  EM  multiple  times  with  a  different  number  of  components  and  different 
feature  subsets.  The  computational  demand  is  much  heavier  than  the  proposed  algorithm,  even 
with  a  heuristic  search  to  guide  the  selection  of  feature  subsets.  Another  strength  of  the  proposed 
algorithm  is  that  by  initialization  with  a  large  number  of  Gaussian  components,  the  algorithm  is  less 
sensitive  to  the  local  minimum  problem  than  the  standard  EM  algorithm.  We  can  further  reduce  the 
complexity  by  adopting  optimization  techniques  applicable  to  standard  EM  for  Gaussian  mixture, 
such  as  sampling  the  data,  compressing  the  data  [28] ,  or  using  efficient  data  structures  [203,  224] . 

For  the  post-processing  step  in  Section  4.3.4,  each  computation  of  the  quantity  J  and  its  gradient 
and  Hessian  takes  0(ndk)  time.  The  number  of  iterations  is  difficult  to  predict,  as  it  depends  on 
the  optimization  routine.  However,  we  can  always  put  an  upper  bound  on  the  number  of  iterations 
and  trade  speed  for  the  optimality  of  the  results. 

4.5.2  Relation  to  Shrinkage  Estimate 

One  interpretation  of  Equation  (4.6)  is  that  we  “regularize”  the  distribution  of  each  feature  in 
different  components  by  the  common  distribution.  This  is  analogous  to  the  shrinkage  estimator  for 
covariance  matrices  of  class-conditional  densities  [68] ,  which  is  a  weighted  sum  of  an  estimate  of  the 
class-specific  covariance  matrix,  and  the  “global”  covariance  matrix  estimate.  In  Equation  (4.6),  the 
pdf  of  the  /-th  feature  is  also  a  weighted  sum  of  a  component-specific  pdf  and  a  common  density.  An 
important  difference  here  is  that  the  weight  p i  is  estimated  from  the  data,  using  the  MML  principle, 
instead  of  being  set  heuristically,  as  is  commonly  done.  As  shrinkage  estimators  have  found  empirical 
success  to  combat  data  scarcity,  this  “regularization”  viewpoint  is  an  alternative  explanation  for  the 
usefulness  of  the  proposed  algorithm. 

4.5.3  Limitation  of  the  Proposed  Algorithm 

A  limitation  of  the  proposed  algorithm  is  the  feature  independence  assumption  (conditioned  on 
the  mixture  component).  While,  empirically,  the  violation  of  the  independence  assumption  usually 
does  not  affect  the  accuracy  of  a  classifier  (as  in  supervised  learning)  or  the  quality  of  clusters 
(as  in  unsupervised  learning),  this  has  some  negative  influence  on  the  feature  selection  problem. 
Specifically,  a  feature  that  is  redundant  because  its  distribution  is  independent  of  the  component 
label  given  another  feature  cannot  be  modelled  under  the  feature  independence  assumption.  As  a 


result,  both  features  are  kept.  This  explains  why,  in  general,  the  feature  saliencies  are  somewhat 
high.  The  post-processing  in  Section  4.3.4  can  cope  with  this  problem  because  it  considers  the 
posterior  distribution  and  therefore  can  discard  features  that  do  not  help  in  identifying  the  clusters 
directly. 


4.5.4  Extension  to  Semi-supervised  Learning 

Sometimes,  we  may  have  some  knowledge  of  the  class  labels  of  different  Gaussian  components.  This 
can  happen  when,  say,  we  adopt  a  procedure  to  combine  different  Gaussian  components  to  form 
a  cluster  ( e.g .,  as  in  [216]),  or  in  a  semi-supervised  learning  scenario,  where  we  can  use  a  small 
amount  of  labelled  data  to  help  us  identify  which  Gaussian  component  belongs  to  which  class.  This 
additional  information  can  suggest  combination  of  several  Gaussian  components  to  form  a  single 
class/cluster,  thereby  allowing  the  identification  of  non-Gaussian  clusters.  The  post-processing  step 
can  take  advantage  of  this  information. 

Suppose  we  know  there  are  C  classes  and  the  posterior  probability  that  pattern  belongs  to  the 
c-th  class,  denoted  r;ic,  can  be  computed  as  r^c  =  —  f  PcjP(zi  =  j|y *)•  For  example,  if  we  know 

that  the  components  4,  6,  and  10  are  from  class  2,  we  can  set  /?2,4  =  P 2,6  =  /?2,10  =  1/3  and  the 
other  /?2  j  to  be  zero.  The  post-processing  is  modified  accordingly:  redefine  fy  in  Equation  (4.31) 
to  tj  =  argmaxcrjC,  i.e.,  it  becomes  the  class  label  for  in  view  of  the  extra  information;  replace 
log P(zi  =  tj|y^)  in  Equation  (4.31)  by  logr^  £..  The  gradient  and  Hessian  can  still  be  computed 
easily  after  noting  that 
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=  ®y‘s7log”'y  =  wij(huj  ~ 9 u) 

d  1  ^  d  ^  Pcjwij 


(4.32) 


^ilj  9il  ■ 


We  can  then  optimize  the  modified  J  in  Equation  (4.31)  to  carry  out  the  post-processing. 


4.5.5  A  Note  on  Maximizing  the  Posterior  Probability 

The  sum  of  the  logarithm  of  the  maximum  posterior  probability  considered  in  the  post-processing 
in  Section  4.3.4  can  be  regarded  as  the  sample  estimate  of  an  unorthodox  type  of  entropy  (see  [141]) 
for  the  posterior  distribution.  It  can  be  regarded  as  the  limit  of  Renyi’s  entropy  Ra(P)  when  a 
tends  to  infinity,  where 

k 

Ra(P)  =  log  ^  pf  •  (4.33) 

j= 1 

When  this  entropy  is  used  for  parameter  estimation  under  the  maximum  entropy  framework,  the 
corresponding  procedure  is  closely  related  to  minimax  inference.  Other  functions  on  the  posterior 
probabilities  can  also  be  used,  such  as  the  Shannon  entropy  of  the  posterior  distribution.  Preliminary 
study  shows  that  the  use  of  different  types  of  entropy  does  not  affect  the  results  significantly. 


4.6  Summary 

Given  n  points  in  d  dimension,  we  have  presented  an  EM  algorithm  to  estimate  the  saliencies 
of  individual  features  and  the  best  number  of  components  for  Gaussian-mixture  clustering.  The 
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proposed  algorithm  can  avoid  running  EM  many  times  with  different  numbers  of  components  and 
different  feature  subsets,  and  can  achieve  better  performance  than  using  all  the  available  features 
for  clustering.  By  initializing  with  a  large  number  of  mixture  components,  our  EM  algorithm  is  less 
prone  to  the  problem  of  poor  local  minima.  The  usefulness  of  the  algorithm  was  demonstrated  on 
both  synthetic  and  benchmark  real  data  sets. 
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Iteration  1  K=30 


Iteration  35  K=13 


Iteration  40  K=10 


Iteration  99  K=5 


Iteration  182  K=4 


Figure  4.6:  An  example  execution  of  the  proposed  algorithm.  The  solid  ellipses  represent  the 
Gaussian  mixture  components;  the  dotted  ellipse  represents  the  common  density.  The  number  in 
parenthesis  along  the  axis  label  is  the  feature  saliency;  when  it  reaches  1,  the  common  component 
is  no  longer  applicable  to  that  feature.  Thus,  in  (d),  the  common  component  degenerates  to  a  line; 
when  the  feature  saliency  for  feature  1  also  becomes  1,  as  in  Figure  4.6(f),  the  common  density 
degenerates  to  a  point  at  (0,  0). 
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(a)  Features  saliencies:  4  Gaussian 

Figure  4.7:  Feature  saliencies  for  (a)  the  10-D  4  Gaussian  data  set  used  in  Figure  4.6(a),  and  (b)  the 
Trunk  data  set.  The  mean  values  plus  and  minus  one  standard  deviation  over  ten  runs  are  shown. 
Recall  that  features  3  to  10  for  the  4  Gaussian  data  set  are  the  noisy  features. 


Figure  4.8:  A  figure  showing  the  clustering  result  on  the  image  data  set.  Only  the  labels  for  the 
testing  data  are  shown,  (a)  The  true  class  labels,  (b)  The  clustering  results  by  Algorithm  4.1.  (c) 
The  clustering  result  using  all  the  features.  The  data  points  are  reduced  to  3D  by  PCA.  A  cluster 
is  matched  to  its  majority  class  before  plotting.  The  error  rates  for  the  proposed  algorithm  and  the 
algorithm  using  all  the  features  in  this  particular  run  are  22%  and  30%,  respectively. 
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(a)  wine,  proposed  algorithm 


(c)  wdbc,  proposed  algorithm 


(b)  wine,  after  post-processing 


5  1  O  15  20 

(d)  wdbc,  after  post-processing 


(e)  image,  proposed  algorithm 


i  ■  ■  ■  ■ _ , 

5  1 O  15  20 

(f)  image,  after  post-processing 


5  1  O  15 


(g)  texture,  proposed  algorithm 


(i)  zernike,  proposed  algorithm 


5  1  O  15 


1  O  20 

(j)  zernike,  after  post- processing 


Figure  4.9:  Image  maps  of  feature  saliency  for  different  data  sets  with  and  without  the  post¬ 
processing  procedure.  Feature  saliency  of  1  (0)  is  shown  as  a  pixel  of  gray  level  255  (0).  The 
vertical  and  horizontal  axes  correspond  to  the  feature  number  and  the  trial  number,  respectively. 


Chapter  5 


Clustering  With  Constraints 


In  Section  1.4,  we  introduced  instance- level  constraints  as  a  type  of  side-information  for  clustering.  In 
this  chapter,  we  shall  examine  the  drawbacks  of  the  existing  clustering  under  constraints  algorithms, 
and  propose  a  new  algorithm  that  can  remedy  the  defects. 

Recall  that  there  are  two  types  of  instance-level  constraints:  a  nrust-link/positive  constraint  re¬ 
quires  two  or  more  objects  to  be  put  in  the  same  cluster,  whereas  a  must-not-link/negative  constraint 
requires  two  or  more  objects  to  be  placed  in  different  clusters.  Often,  the  constraints  are  pairwise, 
though  one  can  extend  them  to  multiple  objects  [231,  167].  Constraints  are  particularly  appropriate 
in  a  clustering  scenario,  because  there  is  no  clear  notion  of  the  target  classes.  On  the  other  hand, 
the  user  can  suggest  if  two  or  more  objects  should  be  included  in  the  same  cluster  or  not.  This 
can  be  done  in  an  interactive  manner,  if  desired.  Side-information  can  improve  the  robustness  of  a 
clustering  algorithm  towards  model  mismatch,  because  it  provides  additional  clues  for  the  desirable 
clusters  other  than  the  shape  of  the  clusters,  as  suggested  by  the  parametric  model.  Side- information 
has  also  been  found  to  alleviate  the  problem  of  local  minima  of  the  clustering  objective  function. 

Clustering  with  instance-level  constraints  is  different  from  learning  with  partially-labeled  data, 
also  known  as  transductive  learning  or  semi-supervised  learning  [136,  288,  157,  169,  289,  287,  98,  195], 
where  the  class  labels  of  some  of  the  objects  are  provided.  Constraints  only  reveal  the  relationship 
among  the  labels,  not  the  labels  themselves.  Indeed,  if  the  “absolute”  labels  can  be  specified,  the 
user  is  no  longer  facing  a  clustering  task,  and  a  supervised  method  should  be  adopted  instead. 

We  contrast  different  learning  settings  according  to  the  type  of  information  available  in  Figure  5.1. 
At  one  end  of  the  spectrum,  we  have  supervised  learning  (Figure  5.1(a)),  where  the  labels  of  all  the 
objects  are  known.  At  the  other  end  of  the  spectrum,  we  have  unsupervised  learning  (Figure  5.1(d)), 
where  the  label  information  is  absent.  In  between,  we  can  have  partially  labeled  data  (Figure  5.1(b)), 
where  the  true  class  labels  of  some  of  the  objects  are  known.  The  main  scenario  considered  in  this 
paper  is  depicted  in  Figure  5.1(c):  there  is  no  label  information,  but  must-link  and  nrust-not-link 
constraints  (represented  by  solid  and  dashed  lines,  respectively)  are  provided.  Note  that  the  settings 
exemplified  in  Figures  5.1(a)  and  5.1(b)  are  classification-oriented  because  there  is  a  clear  definition 
of  different  classes.  On  the  other  hand,  the  setups  in  Figures  5.1(c)  and  5.1(d)  are  clustering-oriented, 
because  no  precise  definitions  of  classes  are  given.  The  clustering  algorithm  needs  to  discover  the 
classes. 

5.0.1  Related  Work 

Different  algorithms  have  been  proposed  for  clustering  under  instance- level  constraints.  In  [262], 
the  four  primary  operators  in  COBWEB  were  modified  in  view  of  the  constraints.  The  fc-means 
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(a)  Supervised 


(b)  Partially  labeled 


(d)  Unsupervised 


Figure  5.1:  Supervised,  unsupervised,  and  intermediate.  In  this  figure,  dots  correspond  to  points 
without  any  labels.  Points  with  labels  are  denoted  by  circles,  asterisks  and  crosses.  In  (c),  the 
must-link  and  must-not-link  constraints  are  denoted  by  solid  and  dashed  lines,  respectively. 


algorithm  was  modified  in  [263]  to  avoid  violating  the  constraints  when  different  objects  are  assigned 
to  different  clusters.  However,  the  algorithm  can  fail  even  when  a  solution  exists.  Positive  constraints 
served  as  “short-cuts”  in  [148]  to  modify  the  dissimilarity  measure  for  complete-link  clustering.  There 
can  be  catastrophic  consequences  if  a  single  constraint  is  incorrect,  because  the  dissimilarity  matrix 
can  be  greatly  distorted  by  a  wrong  constraint.  Spectral  clustering  was  modified  in  [138]  to  work 
with  constraints,  which  augmented  the  affinity  matrix.  Constraints  were  incorporated  into  image 
segmentation  algorithms  by  solving  the  constrained  version  of  the  corresponding  normalized  cut 
problem,  with  smoothness  of  cluster  labels  explicitly  incorporated  in  the  formulation  [279].  Hidden 
Markov  random  field  was  used  in  [14]  for  /c-nreans  clustering  with  constraints.  Constraints  have  also 
been  used  for  metric- learning  [274];  in  fact,  the  problems  of  metric- learning  and  k- means  clustering 
with  constraints  were  considered  simultaneously  in  [21].  Because  the  problem  of  fc-means  with 
metric-learning  is  related  to  EM  clustering  with  a  common  covariance  matrix,  the  work  in  [21]  may 
be  viewed  as  related  to  EM  clustering  with  constraints.  The  work  in  [158]  extended  the  work  in  [21] 
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Summary 

Key  ideas 

Examples 

Distance  edit¬ 
ing 

Modify  the  distance/proximity  matrix  due  to  the  constraints 

[148,  138] 

Constraints  on 
labels 

The  cluster  labels  are  inferred  under  the  restriction  that  the  con¬ 
straints  are  always  satisfied 

[262,  263, 
279] 

Hidden 

Markov 
random  field 

Cluster  labels  constitute  a  hidden  Markov  random  field;  feature 
vectors  are  assumed  to  be  independent  of  each  other  given  cluster 
labels 

[14,  21,  12, 
158,  176, 

161,  286] 

Modify  genera¬ 
tion  model 

Generation  process  of  data  points  that  participate  in  constraints 
is  modified 

[231,  166, 
167] 

Constraints 

resolution 

Clustering  solution  is  obtained  by  resolving  constraints  only 

[10] 

Table  5.1:  Different  algorithms  for  clustering  with  constraints. 


by  studying  the  relationship  between  constraints  and  the  kernel  fc-means  algorithms.  Ideas  based 
on  hidden  Markov  random  field  have  also  been  used  for  model-based  clustering  with  constraints 
[14,  176,  161];  the  difference  between  these  three  methods  lies  in  how  the  inference  is  conducted. 
In  particular,  the  method  in  [14]  used  iterative  conditional  mode  (ICM),  the  method  in  [176]  used 
Gibbs  sampling,  and  the  method  in  [161]  used  a  mean-field  approximation.  The  approach  in  [286] 
is  similar  to  [161],  since  both  used  mean-field  approximation.  However,  the  authors  of  [286]  also 
considered  the  case  when  each  class  is  modeled  by  more  than  one  component.  A  related  idea  was 
presented  in  [231],  which  uses  a  graphical  model  for  generating  the  data  with  constraints.  A  fairly 
different  route  to  clustering  under  constraints  was  taken  by  the  authors  in  [10]  under  the  name 
“correlation  clustering” ,  which  used  only  the  positive  and  negative  constraints  (and  no  information 
on  the  objects)  for  clustering.  The  number  of  clusters  can  be  determined  by  the  constraints. 

Table  5.1  provides  a  summary  of  these  algorithms  for  clustering  under  constraints.  In  most  of 
these  approaches,  clustering  with  constraints  has  been  shown  to  improve  the  quality  of  clustering  in 
different  domains.  Example  applications  include  text  classification  [14],  image  segmentation  [161], 
and  video  retrieval  [231]. 

5.0.2  The  Hypothesis  Space 

An  important  issue  in  parametric  clustering  under  constraints,  namely  the  hypothesis  space,  has 
virtually  been  ignored  in  the  current  literature.  Here,  we  adopt  the  terminology  from  inductive 
learning  and  regard  “hypothesis  space”  as  the  space  of  all  possible  solutions  to  the  clustering  task. 
Since  partitional  clustering  can  be  viewed  as  the  construction  of  a  mapping  from  a  set  of  objects  to 
a  set  of  cluster  labels,  its  hypothesis  space  is  the  set  of  all  possible  mappings  between  the  objects 
(or  their  representations)  and  the  cluster  labels.  In  a  non-parametric  clustering  algorithm  such 
as  pairwise  clustering  [114]  and  methods  based  on  graph-cut  [234,  272],  there  is  no  restriction  on 
this  hypothesis  space.  A  particular  non-parametric  clustering  algorithm  selects  the  best  clustering 
solution  in  the  space  according  to  some  criterion  function.  In  other  words,  if  a  poor  criterion  function 
is  used  (perhaps  due  to  the  influence  of  constraints),  one  can  obtain  a  counter-intuitive  clustering 
solution  such  as  the  one  in  Figure  5.3(c),  where  very  similar  objects  can  be  assigned  different  cluster 
labels.  Note  that  objects  in  non-parametric  clustering,  unlike  in  parametric  clustering,  may  not 
have  a  feature  vector  representation.  They  can  be  represented,  for  example,  by  pairwise  affinity  or 
dissimilarity  measure  with  higher  order  [1], 

The  hypothesis  space  in  parametric  clustering  is  typically  much  smaller,  because  the  parametric 
assumption  imposes  restrictions  on  the  cluster  boundaries.  While  these  restrictions  are  generally 


96 


• 

• 

• 

X 

X  X 

X 

X 

• 

• 

X 

X 

X 

X 

X 

X 

X 

\.  X 

x 

x\  X 

•  •  \ 

• 

• 

(a)  (b)  (c) 

• 

X 

• 

• 

•  • 

X 

X 

X 

X 

X 

• 

•  • 

X 

X 

X 

• 

X 

X 

•  • 

X 

• 

(d)  (e)  (f) 


Figure  5.2:  An  example  contrasting  parametric  and  non-parametric  clustering.  The  particular 
parametric  family  considered  here  is  a  mixture  of  Gaussian  with  a  common  covariance  matrix.  This 
is  reflected  by  the  linear  cluster  boundary.  The  clustering  solutions  in  (a)  to  (c)  are  in  the  hypothesis 
space  induced  by  this  model  assumption,  and  the  clustering  solutions  in  (d)  to  (f)  are  outside  the 
hypothesis  space,  and  thus  can  never  be  obtained,  no  matter  which  objective  function  is  used.  On 
the  other  hand,  all  of  these  six  solutions  are  within  the  hypothesis  space  of  non-parametric  clustering. 
It  is  possible  that  the  clustering  solutions  depicted  in  (d),  (e),  and  (f)  may  be  obtained  if  a  poor 
clustering  objective  function  is  used. 


perceived  as  a  drawback,  they  become  advantageous  when  they  prevent  counter-intuitive  clustering 
solutions  such  as  the  one  in  Figure  5.3(c)  from  appearing.  These  clustering  solutions  are  simply 
outside  the  hypothesis  space  of  parametric  clustering,  and  are  never  attainable  irrespective  of  how 
the  constraints  modify  the  clustering  objective  function. 

An  example  contrasting  parametric  and  non-parametric  clustering  is  shown  in  Figure  5.2.  The 
particular  parametric  family  considered  in  this  example  is  a  Gaussian  distribution  with  common 
covariance  matrix,  resulting  in  linear  cluster  boundaries. 

5.0.2. 1  Inconsistent  Hypothesis  Space  in  Existing  Approaches 

The  basic  idea  of  most  of  the  existing  parametric  clustering  with  instance-level  constraints  algorithms 
[263,  14,  21,  12,  158,  176,  161,  286]  is  to  use  some  variants  of  hidden  Markov  random  fields  to  model 
the  cluster  labels  and  the  feature  vectors.  Given  the  cluster  label  of  the  object,  its  feature  vector  is 
assumed  to  be  independent  of  the  feature  vectors  and  the  cluster  labels  of  all  the  other  objects.  The 
cluster  labels,  which  are  hidden  (unknown),  form  a  Markov  random  field,  with  the  potential  function 
in  this  random  field  related  to  the  satisfiability  of  the  constraints  based  on  the  cluster  labels. 

There  is  an  unfortunate  consequence  of  adopting  the  hidden  Markov  random  field,  however.  For 
objects  participating  in  the  constraints,  their  cluster  labels  are  determined  by  the  cluster  param- 
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eters,  associated  feature  vectors  and  the  constraints.  On  the  other  hand,  for  data  points  without 
constraints,  the  cluster  labels  are  determined  by  only  the  cluster  parameters  and  associated  feature 
vectors.  We  can  thus  see  that  there  is  an  inconsistency  in  how  the  objects  obtain  their  cluster  labels. 
In  other  words,  two  identical  objects,  one  with  a  constraint  and  one  without,  can  be  assigned  differ¬ 
ent  cluster  labels!  This  is  the  underlying  reason  for  the  problem  illustrated  in  Figure  5.3(d),  where 
two  objects  with  almost  identical  feature  vectors  are  assigned  different  labels  due  to  the  constraints. 

From  a  generative  viewpoint,  the  above  inconsistency  is  caused  by  the  difference  in  how  data 
points  with  and  without  constraints  are  generated.  For  the  data  points  without  constraint,  each  of 
them  is  generated  in  an  identical  and  independent  manner  according  to  the  current  cluster  param¬ 
eter  value.  On  the  other,  all  the  data  points  with  constraints  are  generated  simultaneously  by  first 
choosing  the  cluster  labels  according  to  the  hidden  Markov  random  field,  followed  by  the  generation 
of  the  feature  vectors  based  on  the  cluster  labels.  It  is  a  dubious  modeling  assumption  that  “poste¬ 
rior”  knowledge  such  as  the  set  of  instance-level  constraints,  which  are  solicited  from  the  user  after 
observing  the  data,  should  control  how  the  data  are  generated  in  the  first  place. 

Note  that  this  inconsistency  does  not  exist  if  all  the  objects  to  be  clustered  are  involved  in  some 
constraints  determined  by  the  properties  of  the  objects.  This  is  commonly  encountered  in  image 
segmentation  [128],  where  pixel  attributes  (e.g.  intensities  or  filter  responses)  and  spatial  coherency 
based  on  the  locations  of  the  pixels  are  considered  simultaneously  to  decide  the  segment  label.  In 
this  case,  the  cluster  labels  of  all  the  objects  are  determined  by  both  the  constraints  and  the  feature 
vectors. 


5. 0.2. 2  Proposed  Solution 

We  propose  to  eliminate  the  problem  of  inconsistent  hypothesis  space  by  enforcing  a  uniform  way  to 
determine  the  cluster  label  of  an  object.  We  use  the  same  hypothesis  space  of  standard  parametric 
clustering  for  parametric  clustering  under  constraints.  The  constraints  are  only  used  to  bias  the 
search  of  a  clustering  solution  within  this  hypothesis  space.  Since  each  clustering  solution  in  this 
hypothesis  space  can  be  represented  by  the  cluster  parameters,  the  constraints  play  no  role  in 
determining  the  cluster  labels,  given  the  cluster  parameters.  The  quality  of  the  cluster  parameters 
with  respect  to  the  constraints  is  computed  by  examining  how  well  the  cluster  labels  (determined  by 
the  cluster  parameters)  satisfy  the  constraints.  However,  cluster  parameters  that  fit  the  constraints 
well  may  not  fit  the  data  well.  We  need  a  tradeoff  between  these  two  goals.  This  can  be  done  by 
maximizing  a  weighted  sum  of  the  data  log-likelihood  and  a  constraint  fit  term.  The  details  will  be 
presented  in  Section  5.3. 


5.1  Preliminaries 

Given  a  set  of  n  objects  y  =  {yp  . . . ,  y n},  (probabilistic)  parametric  partitional  clustering  discovers 
the  cluster  structure  of  the  data  under  the  assumption  that  data  in  a  cluster  are  generated  according 
to  a  certain  probabilistic  model  p(y\9j),  with  9j  representing  the  parameter  vector  for  the  j-tli 
cluster.  For  simplicity,  the  number  of  clusters,  k,  is  assumed  to  be  specified  by  the  user,  though 
model  selection  strategy  (such  as  minimum  description  length  [81]  and  stability  [162])  can  be  applied 
to  determine  k ,  if  desired.  The  distribution  of  the  data  can  be  written  as  a  finite  mixture  distribution, 
i.e., 

k 

p(  y)  =  J2p(.y\z)p(z)  =  ajp(y\ej)-  (5-1) 

2  3= 1 
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Here,  z  denotes  the  cluster  label,  aj  denotes  p(z  =  j)  (the  prior  probability  of  cluster  j),  and  p(y\6j) 
corresponds  to  p(y\z  =  j).  Clustering  is  performed  by  estimating  the  model  parameter  9 ,  defined 
by  6  =  (oq, . . . ,  aj,.,  . . . ,  9^).  By  applying  the  maximum  likelihood  principle,  9  can  be  estimated 

as  9  =  argmaxg)/  C(9r ; (V),  where  the  log-likelihood  C(9\y)  is  defined  as 


£(o-,y)  =  l°sp(yi)  =  J2 log  ajp(y\ej)- 


i= 1 


i= 1  3= 1 


(5.2) 


This  maximization  is  often  done  by  the  EM  algorithm  [58]  by  regarding  (the  cluster  label  of  y,t) 
as  the  missing  data.  The  posterior  probability  p(z  =  j |y)  represents  how  likely  it  is  that  y  belongs 
to  the  j- th  cluster.  If  a  hard  cluster  assignment  is  desired,  the  MAP  (maximum  a  posteriori)  rule 
can  be  applied  based  on  the  model  in  Equation  (5.1),  i.e.,  the  object  y  is  assigned  to  the  j-th  cluster 

{z=j)  if 


j  =  argmax 


aip(y\el) 

E//  «zp(y|fy)‘ 


(5.3) 


5.1.1  Exponential  Family 

While  there  are  many  possibilities  for  the  form  of  the  probability  distribution  p(y\9j),  it  is  very 
common  to  assume  that  p{y\9j)  belongs  to  the  exponential  family.  The  distribution  p(y\0j)  is  in 
the  exponential  family  if  it  satisfies  the  following  two  criteria:  the  support  of  p{y\9 j)  (the  set  of  y 
with  non-zero  probability)  is  independent  of  the  value  of  9 j,  and  that  p{y\9j)  can  be  written  in  the 
form 

P(y\°j)  =  exp  (<Ky )Ti>(9j)  -  MOj))  ■  (5-4) 

Here,  </>( y)  transforms  the  data  y  to  become  the  “sufficient  statistics”,  meaning  that  </>{ y)  encom¬ 
passes  all  the  relevant  information  of  y  in  the  computation  of  p(y\9 j).  The  function  A{6j),  also 
known  as  the  log-partition  function,  normalizes  the  density  so  that  it  integrates  to  one  over  all  y. 
The  function  tp(9j)  transforms  the  parameter  and  enables  us  to  adopt  different  parameterizations 
of  the  same  density.  When  f/>(.)  is  the  identity  mapping,  the  density  is  said  to  be  in  natural  param¬ 
eterization,  and  9j  is  known  as  the  natural  parameter  of  the  distribution.  The  function  A(9j)  then 
becomes  the  cumulant  generating  function,  and  the  derivative  of  A(6j)  generates  the  cumulant  of 
the  sufficient  statistics.  For  example,  the  gradient  and  Hessian  of  A{9 j)  (with  respect  to  9j)  lead 
to  the  expected  value  and  the  covariance  matrix  for  the  sufficient  statistics,  respectively.  Note  that 
A{9 j )  is  a  convex  function,  and  the  domain  of  9 j  where  the  density  is  well-defined  under  natural 
parameterization  is  also  convex. 

As  an  example,  consider  a  multivariate  Gaussian  density  with  mean  vector  p  and  covariance 
matrix  S.  Its  pdf  is  given  by 

P(  y)  =  exp  (-1  log(27r)  +  1  logdet  S’1  -  i(y  -  pf^iy  -  p)^j  ,  (5.5) 


where  d  is  the  dimension  of  the  feature  vector  y.  If  we  define  T  =  S  '  and  v  =  E  ^  p , 
can  be  rewritten  as 

P(  y)  =  exp  (  trace 


1  T~r 

-jyy  r 


yri/ -  ilog(2^)  +  t  logdet  T  -  )i/rT  lv 


the  above 

(5.6) 


From  this,  we  can  see  that  the  sufficient  statistics  consist  of  — 


1  T 

?yy 


and  y. 


The  set  of  natural 
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parameter  is  given  by  (Y,iz).  The  parameter  v  can  take  any  value  in  R^,  whereas  T  can  only 
assume  values  in  the  positive-definite  cone  of  d  by  d  symmetric  matrices.  Both  these  two  sets  are 
convex,  as  expected.  The  log-cumulant  function  is  given  by 

A(0)  =  |  log(27r)  -  1  logdet  T  +  i 2-uT~1v1  (5.7) 

which  can  be  shown  to  be  convex  within  the  domain  of  T  and  iz,  where  the  density  is  well-defined. 

It  is  interesting  to  note  that  the  exponential  family  is  closely  related  to  Bregman  divergence  [27]. 
For  any  Bregman  divergence  Dp(., .)  derived  from  a  strictly  convex  function  p(.),  one  can  construct 
a  function  fp  such  that 

P(y)  =  exp  (~Dp( y,n))  fp( y) 

is  a  member  of  the  exponential  family.  Here,  //.  is  the  moment  parameter,  meaning  that  it  is  the 
expected  value  of  the  sufficient  statistics^.  The  cumulant  generating  function  of  the  density  is  given 
by  the  Legendre  dual  of  p{.).  One  important  consequence  of  this  relationship  is  that  soft-clustering 
(clustering  where  an  object  can  be  partially  assigned  to  a  cluster)  based  on  any  Bregman  divergence 
can  be  done  by  fitting  a  mixture  of  the  corresponding  distribution  in  the  exponential  family,  as 
argued  in  [9].  Since  Bregman  divergence  includes  many  useful  distance  measures^  as  special  cases 
(such  as  Euclidean  distance  and  Kullback-Leibler  divergence,  and  see  [9]  for  more),  a  mixture  density, 
with  each  component  density  in  the  exponential  family,  covers  many  interesting  clustering  scenarios. 

5.1.2  Instance-level  Constraints 

We  assume  that  the  user  has  provided  side-information  in  the  form  of  a  set  of  instance-level  con¬ 
straints  (denoted  by  C).  The  set  of  must-link  constraints,  denoted  by  C+,  is  represented  by  the 
indicator  variables  d^,  such  that  a ^  =  1  iff  participates  in  the  h- th  must-link  constraint.  For 
example,  if  the  user  wants  to  state  that  the  pair  (y27yg)  participates  in  the  fifth  must-link  con¬ 
straint,  the  user  sets  dg  2  =  1>  «g  8  =  17  and  a5  i  =  0  for  aH  other  i.  This  formulation,  while 
less  explicit  than  the  formulation  in  [161],  which  specifies  the  pairs  of  points  participating  in  the 
constraints  directly,  allows  easy  generalization  to  group  constraints  [166]:  we  simply  set  d ^  to  one 
for  all  yj  that  are  involved  in  the  h- th  group  constraint.  We  also  define  a ^  =  a^/Yi  where 
cifo  can  be  perceived  as  the  “normalized”  indicator  matrix,  in  the  sense  that  Yi  ahi  =  T  The  set 
of  must-not-link  constraints,  denoted  by  ,  is  represented  similarly  by  the  variables  b ^  and  bp  j . 
Specifically,  b ^  =  1  if  y?;  participates  in  the  d-th  must-not-link  constraint,  and  b^j  =  b^.J  Yi  ^hi- 
Note  that  {(ifp}  and  {bpj}  are  highly  sparse,  because  each  constraint  provided  by  the  user  involves 
only  a  small  number  of  points  (two  if  all  the  constraints  are  pairwise). 

5.2  An  Illustrative  Example 

In  this  section,  we  describe  a  simple  example  to  illustrate  an  important  shortcoming  of  parametric 
clustering  under  constraints  methods  based  on  hidden  Markov  random  field  -  the  approach  common 
in  the  literature  [263,  14,  21,  12,  158,  176,  161,  286].  In  Figure  5.3,  there  are  altogether  400  data 
points  generated  by  four  different  Gaussian  distributions.  The  task  is  to  split  this  data  into  two 
clusters. 


1  The  strict  convexity  of  .4  ( . )  implies  that  there  is  an  one-to-one  correspondence  between  moment  parameter  and 
natural  parameter.  While  the  existence  of  such  a  mapping  is  easy  to  show,  constructing  such  a  mapping  can  be 
difficult  in  general. 

2Strictly  speaking,  Bregman  divergence  can  be  asymmetric  and  hence  is  not  really  a  distance  function. 
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Suppose  the  user,  perhaps  due  to  domain  knowledge,  prefers  a  “left”  and  a  “right”  cluster  (as 
shown  in  Figure  5.3(c))  to  the  more  natural  solution  of  a  “top”  and  a  “bottom”  cluster  (as  shown  in 
Figure  5.3(b)).  This  preference  can  be  expressed  via  the  introduction  of  two  must-link  constraints, 
represented  by  the  solid  lines  in  Figure  5.3(a).  When  we  apply  an  algorithm  based  on  hidden 
Markov  random  held  to  discover  the  two  clusters  in  this  example,  we  can  get  a  solution  shown  in 
Figure  5.3(c).  While  cluster  labels  of  the  points  involved  in  the  constraints  are  modified  by  the 
constraints,  there  is  virtually  no  difference  in  the  resulting  cluster  structure  when  compared  with 
the  natural  solution  in  Figure  5.3(b).  This  is  because  the  change  in  the  cluster  labels  of  the  small 
number  of  points  in  constraints  does  not  significantly  affect  the  cluster  parameters.  Not  only  are 
the  clusters  not  what  the  user  seeks,  but  also  the  clustering  solution  is  counter-intuitive:  the  cluster 
labels  of  points  involved  in  the  constraints  are  different  from  their  neighbors  (see  the  big  cross  and 
plus  in  Figure  5.3(c);  the  symbols  are  enlarged  for  clarity). 

Similar  phenomena  of  “non-smooth”  clustering  solution  have  been  observed  in  [279]  in  the  context 
of  normalized  cut  clustering  with  constraints.  A  variation  of  the  same  problem  has  been  used  as 
a  motivation  for  the  “space- level”  instead  of  “instance- level”  constraints  in  [148].  One  way  to 
understand  the  cause  of  this  problem  is  that  the  use  of  hidden  Markov  random  field  effectively  puts 
an  upper  bound  on  the  maximum  influence  of  a  constraint,  irrespective  of  how  large  the  penalty 
for  constraint  violation  is.  So,  the  adjustment  of  the  tradeoff  parameters  cannot  circumvent  this 
problem.  Since  this  problem  is  not  caused  by  the  violation  of  any  constraints,  the  inclusion  of 
negative  constraints  cannot  help. 


5.2.1  An  Explanation  of  The  Anomaly 

In  order  to  have  a  better  understanding  of  why  an  “unnatural”  solution  depicted  in  Figure  5.3(d)  is 
obtained,  let  us  examine  the  hidden  Markov  random  field  approach  for  clustering  under  constraints 
in  more  detail.  In  this  approach,  the  distribution  of  the  cluster  labels  (represented  by  Zj)  and  the 
feature  vectors  (represented  by  y^)  can  be  written  as 

P(yi,---,yn\zi,...,zn,9)  =  \[p{yi\zi) 

i= 1 

p(z  1,  ■■■,zn)  oc  exp  (-H{z\, . . . ,  zn,C+,C~)). 

One  typical  choice  of  the  potential  function  H(z\, . . . ,  Zn,C+  ,C~)  of  the  cluster  labels  is  to  count 
the  number  of  constraint  violations: 

H(z1,...,zn,C+,C-)  =  \+  +  E  I(zi  =  zj)>  (5-8) 

(*J)eC+  (bj)eC- 

where  A~*~  and  A~  are  the  penalty  parameters  for  the  violation  of  the  must-link  and  must-not-link 
constraints,  respectively.  This  potential  function  can  be  derived  [161]  by  the  maximum  entropy 
principle,  with  constraints  (as  in  constrained  optimization)  on  the  number  of  violations  of  the  two 
types  of  instance-level  constraints.  The  assignment  of  points  to  different  clusters  is  determined  by 
the  posterior  probability  p(zi, . . . ,  Zn\y\,  •  •  • ,  y n,  0).  Clustering  is  performed  by  searching  for  the 
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(a)  Data  set,  with  the  constraints 


(b)  Natural  partition  in  2  clusters 


(c)  Desired  2-cluster  solution 


(d)  Solution  by  HMRF 


Figure  5.3:  A  simple  example  of  clustering  under  constraints  that  illustrates  the  limitation  of  hidden 
Markov  random  field  (HMRF)  based  approaches. 


parameters  that  maximize  the  log-likelihood  p(y\, . . . ,  yn\8).  Because 

P(yi,---,yn|0)  =  ^2  p(yi,---,yn\zi,---,zn,0)p(zi1...,zn\0) 

zlT~,zn  (5.9) 

«  arg  max  p( y1; . . . ,  yn\z\, . . . ,  zn,  8)p(zi, ■  ■  ■ ,  zn\6), 
z\,—,zn 

the  result  of  maximizing  p(y\,  ■  ■ . ,  yn\@)  is  often  similar  to  the  result  of  maximizing  the  “hard 

assignment  log- likelihood” ,  defined  by  arg  max  p(y\,  •  ■  ■ , yn\z\,  •  ■  ■ ,  Zn-,  8)p(zi, . . . ,  Zn\0).  This 

zi,-,zn 

illustrates  the  relationship  between  “hard”  clustering  under  constraints  approaches  (such  as  in  [263]) 
and  the  “soft”  approaches  (such  as  in  [161]  and  [14]). 

For  ease  of  illustration,  assume  that  p(y\z  =  j)  is  a  Gaussian  with  mean  vector  p,  „■  and  identity 
covariance  matrix.  The  maximization  of  p(y\,  ■  ■  ■ ,  ynl^li  •  •  ■ ,  Zn,  0)p(z-[, . . . ,  zn\0)  for  the  clustering 
under  constraints  example  in  Figure  5.3  is  equivalent  to  the  minimization  of 

n  2 

H  ^22  I(zi=j)\\yi~  HjW2 +  *+  J2  ^H^Zj), 
i=lj=l  (i,j)eC+ 
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where  the  potential  function  of  the  Markov  random  held  is  as  defined  in  Equation  (5.8),  and  C”*~ 
contains  the  two  must-link  constraints.  Note  that  the  first  term,  the  sum  of  square  Euclidean 
distances  between  data  points  and  the  corresponding  cluster  centers,  is  the  cost  function  for  standard 
fc-means  clustering. 

We  are  going  to  compare  two  cluster  configurations.  The  configuration  “LR” ,  which  consists  of 
a  “left”  and  a  “right”  cluster,  can  be  represented  by  /x^^  =  (—2,0)  and  /xJj  =  (2,0),  and  this 
corresponds  to  the  partition  sought  by  the  user  in  Figure  5.3(c).  The  configuration  “TB”,  which 
consists  of  a  “top”  and  a  “bottom”  cluster,  can  be  represented  by  /x^  =  (0,  —8)  and  /x^  =  (0,  8), 

and  this  corresponds  to  the  “natural”  solution  shown  in  Figure  5.3(b).  When  A^  is  very  small,  the 
natural  solution  “TB”  is  preferable  to  “LR” ,  because  the  points,  on  average,  are  closer  to  the  cluster 
centers  in  “TB”,  and  the  penalty  for  constraint  violation  is  negligible.  As  A~^~  increases,  the  cost 
for  selecting  “TB”  increases.  When  (A”*"  +  ||y7;  —  /x^®||^)  >  ||y?;  —  /x^*||^,  (y^  is  the  point  under 
constraint  in  the  upper  left  point  clouds),  switching  the  cluster  label  of  yt  from  “x”  to  “+”  leads  to 
a  lower  cost  for  the  “TB”  configuration.  This  switching  of  cluster  label  affects  the  cluster  centers 
in  the  “TB”  configuration.  However,  its  influence  is  minimal  because  there  is  only  one  such  point, 
and  the  sum  of  the  square  error  term  in  the  objective  function  is  dominated  by  the  remaining  points 
that  are  not  involved  in  constraints.  As  a  result,  the  sum  of  square  term  is  minimized  when  the 
cluster  centers  are  effectively  unmodified  from  the  “TB”  configuration.  This  leads  to  the  counter¬ 
intuitive  clustering  solution  in  Figure  5.3(c),  where  the  constraints  are  satisfied,  but  the  cluster 
labels  are  “discontinuous”  in  the  sense  that  the  cluster  label  of  an  object  in  the  middle  of  a  dense 
point  cloud  can  assume  a  cluster  label  different  from  those  of  its  neighbors.  A  related  argument 
has  been  used  to  motivate  “space-level”  constraints  in  preference  to  “instance-level”  constraints  in 
[148]:  the  influence  of  instance- level  constraints  may  fail  to  propagate  to  the  surrounding  points. 
This  problem  may  also  be  attributed  to  the  problem  of  the  inconsistent  hypothesis  space  discussed 
in  Section  5. 0.2.1,  because  the  cluster  labels  of  points  under  constraints  are  determined  in  a  way 
that  is  different  from  the  points  without  constraints.  When  A~*~  increases  further,  the  cost  for  this 
counter-intuitive  configuration  remains  the  same,  because  no  constraints  are  violated.  Let  C  denote 
the  cost  of  this  counter-intuitive  configuration. 

We  are  now  in  a  position  to  understand  why  it  is  not  possible  to  attain  the  desirable  configuration 
“LR” .  By  pushing  the  vertical  and  horizontal  point  clouds  away  from  each  other,  we  can  arbitrarily 
increase  the  cost  of  the  “LR”  configuration,  while  keeping  the  cost  of  the  “TB”  configuration  the 
same.  While  the  cost  for  the  counter-intuitive  configuration  also  increases  when  the  two  point  clouds 
are  pushed  apart,  such  an  increase  is  very  slow  because  only  the  distance  of  one  point  (as  in  the  term 
||y?;  —  In)  is  affected.  Consequently,  the  cost  of  “LR”  configuration  can  be  made  larger  than  C, 
which  is  indeed  the  case  for  the  example  in  Figure  5.3.  Therefore,  assuming  that  the  clustering  under 
constraints  algorithm  finds  the  clustering  solution  that  minimizes  the  cost  function,  the  desired  “LR” 
configuration  can  never  be  recovered. 

Note  that  specifying  additional  constraints  (either  must-link  or  must-not-link)  on  points  already 
participating  in  the  constraints  cannot  solve  the  problem,  because  none  of  the  constraints  are  violated 
in  the  counter-intuitive  configuration.  This  problem  remained  unnoticed  in  previous  studies,  because 
it  is  a  consequence  of  a  small  number  of  constraints.  When  there  are  a  large  number  of  data  points 
involved  in  constraints,  the  sum  of  the  square  error  is  no  longer  dominated  by  data  points  not  involved 
in  constraints.  The  enforcement  of  constraints  changes  the  cluster  labels,  which  in  turn  modifies 
the  cluster  centers  significantly  during  the  minimization  of  the  sum  of  error.  The  counter-intuitive 
configuration  is  no  longer  optimal,  and  the  “LR”  configuration  will  be  generated  because  of  its 
smaller  cost.  Note  that  this  problem  is  independent  of  the  probabilistic  model  chosen  to  represent 
each  cluster:  the  same  problem  can  arise  if  there  is  no  restriction  on  the  covariance  matrix,  for 
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example. 

There  are  several  ways  to  circumvent  this  problem.  One  possibility  is  to  increase  the  number  of 
constraints  so  that  the  constraints  involve  a  large  number  of  data  points.  However,  clustering  under 
constraints  is  most  useful  when  there  are  few  constraints,  because  the  creation  of  constraints  often 
requires  a  significant  effort  on  the  part  of  the  user.  Instead  of  soliciting  additional  constraints  from 
the  user,  the  system  should  provide  the  user  an  option  to  increase  arbitrarily  the  influence  of  the 
existing  constraints  -  something  the  hidden  Markov  random  field  approach  fails  to  do.  One  may  also 
try  to  initialize  the  cluster  parameters  intelligently  [13]  so  that  a  desired  local  minimum  (the  “LR” 
configuration  in  Figure  5.3(c))  is  obtained,  instead  of  the  global  minimum  (the  counter-intuitive 
configuration  in  Figure  5.3(d)  or  the  “TB”  configuration  in  Figure  5.3(b),  depending  on  the  value 
of  A^~).  However,  this  approach  is  heuristic.  Indeed,  the  discussion  above  reveals  a  problem  in  the 
objective  function  itself,  and  we  should  specify  a  more  appropriate  objective  function  to  reflect  what 
the  user  really  desires.  The  solution  in  [161]  is  to  introduce  a  parameter  (in  addition  to  A”*”  and  A-) 
that  can  increase  the  influence  of  data  points  in  constraints.  However,  this  approach  introduces  an 
additional  parameter,  and  it  is  also  heuristic.  An  alternative  potential  function  for  use  in  the  hidden 
Markov  random  field  has  been  proposed  in  [14]  to  try  to  circumvent  the  problem. 

Because  the  main  problem  lies  in  the  objective  function  itself,  we  propose  a  principled  solution 
to  this  problem  by  specifying  an  alternative  objective  function  for  clustering  under  constraints. 


5.3  Proposed  Approach 

Our  approach  begins  by  requiring  the  hypothesis  space  (see  Section  5.0.2)  used  by  parametric  cluster¬ 
ing  under  constraints  to  be  the  same  as  the  hypothesis  space  used  by  parametric  clustering  without 
constraints.  This  means  that  the  cluster  label  of  an  object  should  be  determined  by  its  feature  vector 
and  the  cluster  parameters  according  to  the  MAP  rule  in  Equation  (5.3)  based  on  the  standard  finite 
mixture  model  in  Equation  (5.1).  The  constraints  should  play  no  role  in  deciding  the  cluster  labels. 
This  contrasts  with  the  hidden  Markov  random  field  approaches  (see  Section  5.2),  where  both  the 
cluster  labels  and  the  cluster  parameters  can  freely  vary  to  minimize  the  cost  function. 

The  desirable  cluster  parameters  should  (i)  result  in  cluster  labels  that  satisfy  the  constraints, 
and  (ii)  explain  the  data  well.  These  two  goals,  however,  may  conflict  with  each  other,  and  a 
compromise  is  made  by  the  use  of  tradeoff  parameters.  Formally,  we  seek  the  parameter  vector  9 
that  maximizes  an  objective  function  J(9',y  ,C),  defined  by 


J(6;y,c)  =  £{0-y)  +  T{9-C), 


(5.10) 
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where  !F(9;C)  denotes  how  well  the  clusters  specified  in  9  satisfy  the  constraints  in  C.  It  consists  of 
two  terms:  f+(9;C^)  and  f~(9\C^).  The  loss  functions  j~)  and  f~(9;Cj ~)  correspond  to 

the  violation  of  the  h-  th  must-link  constraint  (denoted  by  C^~)  and  the  h- th  must-not-link  constraint 
(denoted  by  C^) ,  respectively.  There  are  altogether  m~*~  must-link  constraints  and  m~  must-not-link 
constraints,  i.e. ,  |C^”|  =  rn^  and  |C^”|  =  to-.  The  log-likelihood  term  C(9\  A1),  which  corresponds  to 
the  fit  of  the  data  y  by  the  model  parameter  9 ,  is  the  same  as  the  log-likelihood  of  the  finite  mixture 
model  used  in  standard  parametric  clustering  (Equation  (5.2)).  The  parameters  A^  and  A?-  give  us 
flexibility  to  assign  different  weights  on  the  constraints.  In  practice,  they  are  set  to  a  common  value 
A.  The  value  of  A  can  either  be  specified  by  the  user,  or  it  can  be  estimated  by  a  cross-validation 
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type  of  procedure.  For  brevity,  sometimes  we  drop  the  dependence  of  J  on  9}  y  and  C  and  write  J 
as  the  objective  function. 

How  can  this  approach  be  superior  to  the  HMRF  approaches?  A  counter-intuitive  clustering 
solution  such  as  the  one  depicted  in  Figure  5.3(d)  is  no  longer  attainable.  The  cluster  boundaries 
are  determined  solely  by  the  cluster  parameters.  So,  in  the  example  in  Figure  5.3(d),  the  top-left 
“big  plus”  point  will  assume  the  cluster  label  of  “x”,  whereas  the  bottom-right  “big  cross”  point 
will  assume  the  cluster  label  of  “+” ,  based  on  the  value  of  the  cluster  parameters  as  shown  in  the 
figure.  The  second  benefit  is  that  the  effect  of  the  instance-level  constraints  is  propagated  to  the 
surrounding  points  automatically,  thereby  achieving  the  effect  of  the  desirable  space-level  constraints. 
This  is  because  parametric  cluster  boundaries  divide  the  data  space  into  different  contiguous  regions. 
Another  advantage  of  the  proposed  approach  is  that  it  can  obtain  clustering  solutions  unattainable 
by  HMRF  approaches.  For  example,  the  “TB”  configuration  in  Figure  5.3(b)  can  be  made  to  have 
an  arbitrarily  high  cost  by  increasing  the  value  of  the  constraint  penalty  parameter  A”*”.  Since  the 
cost  of  the  “LR”  configuration  is  not  affected  by  A~*~,  the  “LR”  configuration  will  have  a  smaller 
cost  than  the  “TB”  configuration  with  a  large  A~K  When  the  cost  function  is  minimized,  the  “LR” 
configuration  sought  by  the  user  will  be  returned. 


5.3.1  Loss  Function  for  Constraint  Violation 


What  should  be  the  form  of  the  loss  functions  f+(9;C^)  and  (9;C^)7  Suppose  the  points  y,j  and 
y j  participate  in  a  must-link  constraint.  This  must-link  constraint  is  violated  if  the  cluster  labels 
(for  yj)  and  zj  (for  yj),  determined  by  the  MAP  rule,  are  different.  Define  Zj  to  be  a  vector 
of  length  k,  such  that  its  Z-tli  entry  is  one  if  z j  =  l ,  and  zero  otherwise.  The  number  of  constraint 
violations  can  be  represented  by  d( Zj ,  z j )  if  d  is  a  distance  measure  such  that  d( Zj ,  Zj)  =  1  if  z^  ^  Zj 
and  zero,  otherwise.  Similarly,  the  violation  of  a  must-not-link  constraint  between  y  ■*  and  y  ■*  can 
be  represented  by  1  —  d(z-*,z  ■*),  where  y-*  and  y  ■*  are  involved  in  a  must-not-link  constraint. 

Adopting  such  a  distance  function  d(., .)  as  the  loss  functions  /”*~(.)  and  /—  (.)  is,  however,  not 
a  good  idea  because  d(z,t,  zj)  is  a  discontinuous  function  of  9 ,  due  to  the  presence  of  argmax  in 
Equation  (5.3).  In  order  to  construct  an  easier  optimization  problem,  we  “soften”  z;  and  define  a 
new  vector  Sj  by 

sa  =  (7P(y^))T.r  =  (5.12) 

where  qjj  =  a^p( y?;|^),  and  r  is  the  smoothness  parameter.  When  r  goes  to  infinity,  Sj  approaches 
Zj,  whereas  a  small  value  of  r  leads  to  a  smooth  loss  function,  which,  in  general,  has  a  less  severe 
local  optima  problem. 

Another  issue  is  the  choice  of  the  distance  function  d(sj,  Sj).  Since  Sjj  >  0  and  E/  SU  —  1)  SU  lias 
a  probabilistic  interpretation.  A  divergence  is  therefore  more  appropriate  than  a  common  distance 
measure  such  as  the  Minkowski  distance  for  comparing  Sj  and  s j.  We  adopt  the  Jensen-Shannon 
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divergence  Djg(s^,Sj)  [173]  with  a  uniform  class  prior  as  the  distance  measure: 

1  (  k  g  k  s 

DJS(sk  sj)  =  2  \Y  SH  lo§  77  +  Y  sjl  lo§  77 

\l= 1  1  1=1  1 

1  /  k  k  \ 

=  2  H  SU  logsH  +  Z!  s7  loS  sj7 

\Z=1  Z=1  / 

where  H  =  ^(sil+ sjl)- 

There  are  several  desirable  properties  of  Jensen-Shannon  divergence.  It  is  symmetric,  well-defined 
for  all  s,j  and  s j,  and  its  square  root  can  be  shown  to  be  a  metric  [76,  199].  The  minimum  value 
of  0  for  Djg(., .)  is  attained  only  when  =  s j.  It  is  upper-bounded  by  a  constant  (log 2),  and 
this  happens  only  when  s,j  and  s j  are  farthest  apart,  i.e.,  when  s,^  =  1  and  Sj ^  =  1  with  l  7  h. 
Because  2-Pjg(zrzj)  =  1  if  z;L  7  zj  and  0  otherwise,  the  Jensen-Shannon  divergence  satisfies 
(up  to  a  multiplicative  constant)  the  desirable  property  of  a  distance  measure  as  described  earlier 
in  this  section.  Note  that  Kullback-Leibler  divergence  can  become  unbounded  when  Sj  and  s j  have 
different  supports,  and  thus  it  is  not  an  appropriate  choice. 

Jensen-Shannon  divergence  has  an  additional  appealing  property:  it  can  be  generalized  to  mea¬ 
sure  the  difference  between  more  than  two  distributions.  This  gives  a  very  natural  extension  to 
constraints  at  the  group  level  [231,  166].  Suppose  e  objects  participate  in  the  h- th  group- level  must- 
link  constraint.  This  is  denoted  by  the  variables  aju  introduced  in  Section  5.1.2,  where  a^j  =  1/e  if 
y j  participates  in  this  constraint,  and  zero  otherwise.  The  Jensen-Shannon  divergence  for  the  ft.-th 
must-link  constraint  D~jg(h)  is  defined  as 

n  k  n  k  k 

djs w  =  YahiY  su  lc,g  S'  =  Y  aM  Y  su  l°zsu  -  Y  lhi  logthv  (5-14) 

i= 1  1=1  hi  i= 1  1=1  1=1 

n 

where  t+  =  ahisu. 

i= 1 

Similarly,  the  Jensen-Shannon  divergence  for  the  h- th  must-not-link  constraint  D jg(h)  is  defined 
as 

n  k  n  k  k 

DJS W  =  sn  lo§  p-  =  Y  bhi  Y  sil  l°Ssil  -  Y  %  logthV  (5-15) 

i=l  1=1  thl  i= 1  1=1  1=1 

n 

where  t“  =  Y  bhisil 

i=l 

Here,  h denotes  the  must-not-link  constraint  as  discussed  in  Section  5.1.2.  The  proposed  objective 
function  in  Equation  (5.10)  can  be  rewritten  as 


*  (5'13) 

~Yh  logtb 
1=1 


J  =  C{6\y)  +  F(9-,C) 


1+ 


=  c 


annealed 


m 


h=l 


h=l 


(5.16) 
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where  the  annealed  log-likelihood  £annea^e<^(0;  3^,  7),  defined  in  Equation  (B.2),  is  a  generalization 
of  the  log- likelihood  intended  for  deterministic  annealing.  When  7=1,  £annea^ec^(0;  3^,  7)  equals 
C(0;y).  Note  that  both  D~j^(h)  and  Dj^(h)  are  functions  of  9. 

5.4  Optimizing  the  Objective  Function 

The  proposed  objective  function  (Equation  (5.16))  is  more  difficult  to  optimize  than  the  log- 
likelihood  (Equation  (5.2))  used  in  standard  parametric  clustering.  We  cannot  derive  any  efficient 
convex  relaxation  for  J ,  meaning  that  a  bound-optimization  procedure  such  as  the  EM  algorithm 
cannot  be  applied.  We  resort  to  general  nonlinear  optimization  algorithms  to  optimize  the  objective 
function.  In  Section  5.4.1,  we  shall  present  the  general  idea  of  these  algorithms.  After  describing 
some  details  of  the  algorithms  in  Section  5.4.2,  we  present  the  specific  equations  used  for  a  mixture 
of  Gaussians  in  Section  5.4.3.  Note  that  these  algorithms  are  often  presented  in  the  literature  as 
minimization  algorithms.  Therefore,  we  minimize  —J  rather  than  maximizing  J  in  practice. 

5.4.1  Unconstrained  Optimization  Algorithms 

Different  algorithms  have  been  attempted  to  optimize  the  proposed  objective  function  J .  They  in¬ 
clude  conjugate  gradient,  quasi-Newton,  preconditioned  conjugate  gradient,  and  line-search  Newton. 
Because  these  algorithms  are  fairly  well-documented  in  the  literature  [87,  23] ,  we  shall  only  describe 
their  general  ideas  here.  All  of  these  algorithms  are  iterative  and  require  an  initial  parameter  vector 


5. 4. 1.1  Nonlinear  Conjugate  Gradient 

The  key  idea  of  nonlinear  conjugate  gradient  is  to  maintain  the  descent  directions  dW  in  different 
iterations,  so  that  different  dW  are  orthogonal  (conjugate)  to  each  other  with  respect  to  some 
approximation  of  the  Hessian  matrix.  This  can  prevent  the  inefficient  “zig-zag”  behavior  encountered 
in  steepest  descent,  which  always  uses  the  negative  gradient  for  descent.  Initially,  d(0)  equals  the 
negative  gradient  of  the  function  to  be  minimized.  At  iteration  t ,  a  line-search  is  performed  along 
dWp.e.  ,  we  seek  r\  such  that  the  objective  function  evaluated  at  9^)  +r]d^  is  minimized,  where  6^) 
is  the  current  parameter  estimate.  The  parameter  is  then  updated  by  f?(Ul)  =  g{t)  +  7/dW .  The 
next  direction  of  descent  d(^4l)  js  found  by  computing  a  vector  that  is  (approximately)  conjugate 
to  previous  descent  directions.  Many  different  schemes  have  been  proposed  for  this,  and  we  follow 
the  suggestion  given  in  the  tutorial  [232]  and  adopt  the  Polak-Ribiere  method  with  restarting  to 
update  d^+b: 


<4u~i) 

d(U!) 


=  max 


(r(Ul))T(r(f+l)  _r(t)) 


(r  (i))Tr(i) 
=  r(Ul)  _|_  ^(t+l)j(f)# 


Note  that  line-search  in  conjugate  gradient  should  be  reasonably  accurate,  in  order  to  ensure  that 
the  search  directions  d^)  are  indeed  approximately  conjugate  (see  the  discussion  in  Chapter  7  in 
[23]). 

The  main  strength  of  conjugate  gradient  is  that  its  memory  usage  is  only  linear  with  respect  to 
the  number  of  variables,  thereby  making  it  attractive  for  large  scale  problems.  Conjugate  gradient 
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has  also  found  empirical  success  in  fitting  a  mixture  of  Gaussians  [222],  and  is  shown  to  be  more 
efficient  than  the  EM  algorithm  when  the  clusters  are  highly  overlapping. 


5. 4. 1.2  Quasi-Newton 

Consider  the  second-order  Taylor  expansion  for  a  real- valued  function  /(x),  which  is 

f{9)  k.  /(#(*) )  +  {9-  6»W)Tg(6»W)  +  -  6»W)TH(6»W)(6»  -  0^),  (5.17) 

where  g(x)  and  H(xq)  denote  the  gradient  and  the  Hessian  of  the  function  /(.)  evaluated  with 
x  =  xq.  For  brevity,  we  shall  drop  the  reference  to  9&)  for  both  g  and  H.  Assuming  that  H  is  positive 
definite,  the  right-hand-side  of  the  above  approximation  can  be  minimized  by  9  =  9^)  —  H ^g. 

The  quasi-Newton  algorithm  does  not  require  explicit  knowledge  of  the  Hessian  H,  which  can 
sometimes  be  tricky  to  obtain.  Instead,  it  maintains  an  approximate  Hessian  H,  which  should  satisfy 
the  quasi-Newton  condition: 

#(£+!)  -  #(*)  =  H~1(g(i+1)  -  g(*)). 

Since  the  inversion  of  the  Hessian  can  be  computationally  expensive,  G  W ,  the  inverse  of  the  Hessian 
is  approximated  instead.  While  different  schemes  to  update  G are  possible,  the  de  facto  standard 
is  the  BFGS  (Broyden-Fletcher-Goldfarb-Shanno)  procedure.  Below  is  its  description  taken  from 
[23]: 


p  =  6)(m)  _g(t) 

V  =  -(g(i+1)  -  gW) 
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Given  that  G  W  is  positive-definite  and  the  round-off  error  is  negligible,  the  above  update  guarantees 
that  G^+l)  is  positive-definite.  The  initial  value  of  the  approximated  inverse  Hessian  g(^)  is  often 
set  to  the  identity  matrix.  Note  that  an  alternative  approach  to  implement  quasi-Newton  is  to 
maintain  the  Cholesky  decomposition  of  the  approximated  Hessian  instead.  This  has  the  advantage 
that  the  approximated  Hessian  is  guaranteed  to  be  positive  definite  even  when  the  round-off  error 
cannot  be  ignored. 

In  practice,  the  quasi-Newton  algorithm  is  accompanied  with  a  line-search  procedure  to  cope 
with  the  error  in  the  Taylor  approximation  in  Equation  (5.17)  when  9  is  far  away  from  9^\  The 
descent  direction  used  is  —  H—  IgW.  Note  that  if  H  is  positive  definite,  —  g^H~ IgW  will  be 
always  negative  and  —  H—  ^g^  will  be  a  valid  descent  direction. 

The  main  drawback  of  the  quasi-Newton  method  is  its  memory  requirement.  The  approximate 

q 

inverse  Hessian  requires  0(|#|z)  memory,  where  \9\  is  the  number  of  variables  in  9.  This  can  be  slow 
for  high-dimensional  9 ,  which  is  the  case  when  the  data  yt  is  of  high  dimensionality. 


5. 4. 1.3  Preconditioned  Conjugate  Gradient 

Both  conjugate  gradients  and  quasi-Newton  require  only  the  gradient  information  of  the  function 
to  be  minimized.  Faster  convergence  is  possible  if  we  incorporate  the  analytic  form  of  the  Hessian 
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matrix  into  the  optimization  procedure.  However,  what  really  can  help  is  not  the  Hessian,  but  the 
inverse  of  the  Hessian.  Since  the  inversion  of  the  Hessian  can  be  slow,  it  is  common  to  adopt  some 
approximation  of  the  Hessian  matrix  so  that  its  inversion  can  be  done  quickly. 

Preconditioned  conjugate  gradient  (PCG)  uses  an  approximation  to  the  inverse  Hessian  to  speed 
up  conjugate  gradient.  The  approximation,  also  known  as  the  preconditioner,  is  denoted  by  M. 
PCG  essentially  creates  an  optimization  problem  that  has  as  the  “effective”  Hes¬ 

sian  matrix  and  applies  conjugate  gradient  to  it,  where  H  is  the  Hessian  matrix  of  the  original 
optimization  problem.  If  the  “effective”  Hessian  matrix  is  close  to  the  identity,  conjugate  gradient 
can  converge  very  fast. 

We  refer  the  reader  to  the  appendix  in  [232]  for  the  exact  algorithm  for  PCG.  Practical  imple¬ 
mentation  of  PCG  does  not  require  the  computation  of  M~  Only  the  multiplication  by  M~ ^  is 
needed.  Note  that  the  preconditioner  should  be  positive  definite,  or  the  descent  direction  computed 
may  not  decrease  the  objective  function.  We  can  see  that  there  are  three  requirements  for  a  good 
conditioner:  positive  definite,  efficient  inversion,  and  good  approximation  of  the  Hessian.  The  first 
and  the  third  requirements  can  contradict  with  each  other,  because  the  true  Hessian  is  often  not 
positive-definite  unless  the  objective  function  is  convex.  Finding  a  good  preconditioner  is  an  art, 
and  often  requires  insights  into  the  problem  at  hand.  However,  general  procedures  for  creating  a 
preconditioner  also  exist,  which  can  be  based  on  incomplete  Cholesky  factorization,  for  example. 

5. 4. 1.4  Line-search  Newton 

Line-search  Newton  is  almost  the  same  as  the  quasi-Newton  algorithm,  except  that  the  Hessian  is 
provided  by  the  user  instead  of  being  approximated  by  the  gradients.  There  is,  however,  a  catch 
here.  The  true  Hessian  may  not  be  positive-definite,  meaning  that  the  minimization  problem  on 
the  right-hand-side  of  Equation  (5.17)  does  not  have  a  solution.  Therefore,  it  is  common  to  replace 
the  true  Hessian  with  some  approximated  version  that  is  positive-definite.  Since  H~  is  to  be 
computed,  such  an  approximation  should  admit  efficient  inversion,  or  at  least  multiplication  by  its 
inverse  should  be  fast.  There  are  two  possible  ways  to  obtain  such  an  approximation.  We  can  either 
add  £1  to  the  true  Hessian,  where  £  is  some  positive  number  determined  empirically,  or  we  can 
“repair”  H  by  adding  some  terms  to  it  to  convert  it  to  a  positive-definite  matrix. 

Note  that  for  both  line-search  Newton  and  PCG,  the  approximated  inverse  of  the  Hessian,  which 
takes  0(| 0|z)  memory,  need  not  be  formed  explicitly.  The  only  thing  needed  is  the  ability  to  be 
multiplied  by  the  approximated  inverse. 


5.4.2  Algorithm  Details 

There  are  several  issues  that  are  common  to  all  these  optimization  algorithms. 


5.4.2. 1  Constraints  on  the  Parameters 


The  algorithms  described  in  Section  5.4.1  are  all  unconstrained  optimization  algorithms,  meaning 
that  there  are  no  restrictions  on  the  values  of  6.  However,  our  optimization  problem  contains  the 
constraint  that  the  mixture  weights  aj  are  positive  and  sum  to  one,  and  the  fact  that  the  precision 
matrix  Yy  is  symmetric  and  positive  definite.  For  {aj},  we  re-parameterize  by  introducing  a  set  of 
variables  {Pj}  and  set 


exp  (Pj) 

aJ  =  Ei^MPiY 


(5.18) 
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n~i 

For  T  j,  we  can  either  re-parameterize  by  introducing  Fy  such  that  T j  =  F  jFj  ,  or  we  can  modify 
our  optimization  algorithm  to  cope  with  the  constraints.  The  positive-definite  constraint  is  enforced 
by  modifying  the  line-search  routine  so  that  the  parameters  are  always  feasible.  This  is  a  feasible 
approach  because  the  precision  matrices  in  a  reasonable  clustering  solution  should  not  become  near 
singular.  For  the  symmetric  constraint,  it  is  enforced  by  requiring  that  the  descent  direction  in 
line-search  always  has  symmetric  precision  matrices. 


5. 4. 2. 2  Common  Precision  Matrix 

A  common  practice  of  fitting  a  mixture  of  Gaussian  is  to  assume  a  common  precision  matrix, 
i.e. ,  the  precision  matrices  of  all  the  k  Gaussian  components  are  restricted  to  be  the  same,  i.e. , 
T^=---  =  T^.  =  T.  Instead  of  the  gradient  with  respect  to  different  Ty,  we  need  the  gradient  of 
J  with  respect  to  T.  This  can  be  done  easily  because 

dj  _  A  dj  dT j  _  *  dj 

dr  ~  A  or ,  dr  ~  A  dr.-' 

j= 1  J  3= 1  J 

Consequently,  Equation  (B.12)  should  be  modified  to 

-■ -5  E  W  +  V  +  ir-1) 

ij  3  i 

whereas  Equation  (B.20)  should  be  modified  to 

—J  =  -Y  ^2cij  ~  2  ^2cij(yi  ~  Vj)(Yi  -  Vj)T ■ 
ij  ij 

The  case  for  Cholesky  parameterization  is  similar.  We  set  Fj_  =  •  •  •  =  F j,  =  F,  and  Equation  (B.15) 
should  be  modified  to 

=  -J2cijyiyIF  +  H(tijtij  +  r~1)FJ2cij’  (5-21) 

ij  j  i 

and  Equation  (B.21)  should  be  modified  to 

=  Y-1F  ^  c,y  -  cij(yi  ~  t1j)(yi  -  Pj)T F.  (5.22) 

ij  ij 


(5.19) 


(5.20) 


5. 4. 2. 3  Line  Search  Algorithm 

The  line-search  algorithm  we  used  is  based  on  the  implementation  in  Matlab,  which  is  in  turn  based 
on  section  2.6  in  [86].  Its  basic  idea  is  to  perform  a  cubic  interpolation  based  on  the  value  of  the 
function  and  the  gradient  evaluated  at  two  parameter  values.  The  line  search  terminates  when  the 
Wolfe’s  condition  is  satisfied.  Following  the  advice  in  Chapter  7  of  [23],  the  line-search  is  stricter  for 
both  conjugate  gradient  and  preconditioned  conjugate  gradient  in  order  to  ensure  conjugacy.  Note 
that  when  the  Gaussians  are  parameterized  by  their  precision  matrices,  the  line  search  procedure 
disallows  any  parameter  vector  that  has  non-positive  definite  precision  matrices. 
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5. 4. 2. 4  Annealing  the  Objective  Function 

The  algorithms  described  in  Section  5.4.1  find  only  the  local  minima  of  J  based  on  the  initial 
parameter  estimate  #(®)  One  strategy  to  alleviate  this  problem  is  to  adopt  a  deterministic-annealing 
type  of  procedure  and  use  a  “smoother”  version  of  the  objective  function.  The  solution  of  this 
“smoother”  optimization  problem  is  used  as  the  initial  guess  for  the  actual  objective  function  to 
be  optimized.  Specifically,  we  adjust  the  two  temperature-like  parameters  7  and  r  in  J  defined  in 
Equation  (5.16).  When  7  and  r  are  small,  J  is  smooth  and  is  almost  convex,  therefore  it  is  easy 
to  optimize.  The  annealing  stops  when  7  reaches  one  and  r  reaches  a  pre-specified  value  r^na^, 
which  is  set  to  four  in  our  experiment.  This  is,  however,  a  fairly  insensitive  parameter.  Any  number 
between  one  and  sixteen  leads  to  similar  clustering  results. 


5.4.3  Specifics  for  a  Mixture  of  Gaussians 

All  the  algorithms  described  in  Section  5.4.1  require  the  gradient  information  of  the  objective  func¬ 
tion.  In  Appendix  B.l,  we  have  derived  the  gradient  information  with  the  assumption  that  each 
mixture  component  is  a  Gaussian  distribution.  Recall  that  q^j  =  \ogp(y^\6  j),  and  Sjj  has  been 
defined  in  Equation  (5.12).  Define  the  following: 
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The  partial  derivative  of  J  with  respect  to  fij  is 
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Under  the  natural  parameterization  uj  and  T 7  for  the  parameters  of  the  Z-tli  cluster,  we  have 
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If  the  Cholesky  parameterization  is  used  instead  of  T?,  we  have 
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If  the  moment  parameterization  and  are  used  instead,  we  have 
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and  the  corresponding  partial  derivative  if  Cholesky  parameterization  is  used  is 
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The  Hessian  of  J  is  clumsier  to  present,  and  the  reader  can  refer  to  Appendix  B.2  for  its  exact  form 
under  various  parameterizations. 


5.5  Feature  Extraction  and  Clustering  with  Constraints 

It  turns  out  that  the  objective  function  introduced  in  Section  5.3  can  be  modified  to  simultaneously 
perform  feature  extraction  and  clustering  with  constraints.  There  are  three  reasons  why  we  are 
interested  in  performing  these  two  tasks  together. 

First,  the  proposed  algorithm  does  not  perform  well  for  small  data  sets  with  a  large  number 
of  features  (denoted  by  d),  because  the  d  by  d  covariance  matrix  is  estimated  from  the  available 
data.  In  other  words,  we  are  suffering  from  the  curse  of  dimensionality.  The  standard  solution 
is  to  preprocess  the  data  by  reducing  the  dimensionality  using  methods  like  principal  component 
analysis.  However,  the  resulting  low-dinrensional  representation  may  not  be  optimal  for  clustering 
with  the  given  set  of  constraints.  It  is  desirable  to  incorporate  the  constraints  in  seeking  a  good 
low-dimensional  representation. 

The  second  reason  is  from  a  modeling  perspective.  One  can  argue  that  it  is  inappropriate  to  model 
the  two  desired  clusters  shown  in  Figure  5.3(d)  by  two  Gaussians,  because  the  distribution  of  the  data 
points  are  very  “non-Gaussian” :  there  are  no  data  points  in  the  central  regions  of  the  two  Gaussians, 
which  are  supposed  to  have  the  highest  data  densities!  If  the  data  points  are  projected  to  the  one- 
dinrensional  subspace  of  the  a:-axis,  the  resulting  two  clusters  follow  the  Gaussian  assumption  well 
while  satisfying  the  constraints.  Note  that  PGA  selects  a  projection  direction  that  is  predominantly 
based  on  the  y-axis  because  the  data  variance  in  that  direction  is  large.  However,  the  clusters  formed 
after  such  a  projection  will  violate  the  constraints.  In  general,  it  is  quite  possible  that  given  a  high 
dimensional  data  set,  there  exists  a  low-dimensional  subspace  such  that  the  clusters  after  projection 
are  Gaussian-like,  and  the  constraints  are  satisfied  by  those  clusters. 

The  third  reason  is  that  the  projection  can  be  combined  with  the  kernel  trick  to  achieve  clusters 
with  arbitrary  shapes.  A  nonlinear  transformation  is  applied  to  the  data  set  to  embed  the  data  in  a 
high-dimensional  feature  space.  A  linear  subspace  of  the  given  feature  space  is  sought  such  that  the 
Gaussian  clusters  formed  in  that  subspace  are  consistent  with  the  given  set  of  constraints.  Because 
of  the  non-linear  transformation,  linear  cluster  boundaries  in  that  subspace  correspond  to  nonlinear 
boundaries  in  the  original  input  space.  The  exact  form  of  the  nonlinear  boundaries  is  controlled 
by  the  type  of  the  nonlinear  transformation  applied.  Note  that  such  transformation  need  not  be 
performed  explicitly  because  of  the  kernel  trick  (see  Section  2.5.1).  In  practice,  kernel  PCA  is  first 
performed  on  the  data  in  order  to  extract  the  main  structure  in  the  high  dimensional  feature  space. 
The  number  of  features  returned  by  kernel  PCA  should  be  large.  The  feature  extraction  algorithm 
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in  this  section  is  then  applied  to  the  result  of  kernel  PCA. 


5.5.1  The  Algorithm 

Let  x,j  be  the  result  of  projecting  the  data  point  y(  into  a  (/-dimensional  space,  where  d!  is  small  and 
d  <  d ,  and  d  is  the  dimension  of  y,;.  Let  P  be  the  a  by  d  projection  matrix,  i.e.,  x,;  =  P  y and 

rji  rri 

P  P  =  I.  Let  P  fi  j  and  T  be  the  cluster  center  of  the  j-th  Gaussian  component  and  the  common 

T1 

covariance  matrix,  respectively.  Let  R  be  the  Cholesky  decomposition  of  T,  i.e.,  T  =  RR  .  We 
have 

=j)  =  (27r)_d  /2(detT)1/2exp  ~  pTMj)TT(xj  -  pT^-)')  ■  (5-30) 


Because  T  = 


PR,  we  can  rewrite  the  above  as 
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=  -y  l°g(27r)  +  ^logdetFTF-  i(y?;  -/xJ-)TFFr(yi  -/Xj), 
where  F  =  PR.  Note  the  similarity  between  this  expression  and  that  of  \ogp(yAzA  —  j)  if  we  adopt 
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the  parameterization  T  =  FF  as  discussed  in  Section  5. 4. 2.1.  We  have 

—  logp(x.j|^  =  j)  =  FFr(y?;  -  Hj),  (5.32) 

togP{*i\zi  =  j )  =  F(FTF)_1  -  (y?;  -  Mj)(y*  -  Mj)TF-  (5-33) 

While  P  has  an  orthogonality  constraint,  there  is  no  constraint  on  F,  and  thus  we  cast  our  opti¬ 
mization  problem  in  terms  of  F.  The  parameters  F,  fj,j  and  fdj  can  be  found  by  optimizing  J , 
after  substituting  Equation  (5.31)  as  log  q^j  into  Equation  (5.16).  In  practice,  the  quasi-Newton 
algorithm  is  used  to  find  the  parameters  that  minimize  the  objective  function,  because  it  is  difficult 
to  inverse  the  Hessian  efficiently. 

It  is  interesting  to  point  out  that  this  subspace  learning  procedure  is  related  to  linear  discriminant 
analysis  if  the  data  points  yj  are  standardized  to  have  equal  variance.  If  we  fix  T  to  be  the  identity 
matrix,  maximizing  the  log-likelihood  is  the  same  as  minimizing  (y^  —  Hj)1  PJ  P(y^  —  P j)-  This  is 
the  within-class  scatter  of  the  j-tli  cluster.  Since  the  sum  of  between-class  scatter  and  the  within- 
class  scatter  is  the  total  data  scatter,  which  is  constant  because  of  the  standardization,  maximization 
of  the  within-class  scatter  is  the  same  as  maximizing  the  ratio  of  between-class  scatter  to  the  within- 
class  scatter.  This  is  what  linear  discriminant  analysis  does. 


5.6  Experiments 

To  verify  the  effectiveness  of  the  proposed  approach,  we  have  applied  our  algorithm  to  both  synthetic 
and  real  world  data  sets.  We  compare  the  proposed  algorithm  with  two  state-of-the-art  algorithms 
for  clustering  under  constraints.  The  first  one,  denoted  by  Shental,  is  the  algorithm  proposed  by 
Shental  et  al.  in  [231].  It  uses  “chunklets”  to  represent  the  cluster  labels  of  the  objects  involved  in 
must-link  constraints,  and  a  Markov  network  to  represent  the  cluster  labels  of  objects  participating 
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in  must-not-link  constraints.  The  EM  algorithm  is  used  for  parameter  estimation,  and  the  E-step 
is  done  by  computations  within  the  Markov  network.  It  is  not  clear  from  the  paper  the  precise 

O 

algorithm  used  for  the  inference  in  the  E-step,  though  the  Matlab  implementation0  provided  by 
the  authors  seems  to  use  the  junction  tree  algorithm.  This  can  take  exponential  time  when  the 
constraints  are  highly  coupled.  This  potential  high  time  complexity  is  the  motivation  for  the  mean- 
field  approximation  used  in  the  E-step  of  [161].  The  second  algorithm,  denoted  by  Basu,  is  the 
constrained  k-means  algorithm  with  metric  learning^  described  in  [21].  It  is  based  on  the  idea  of 
hidden  Markov  random  field,  and  it  uses  the  constraints  to  adjust  the  metrics  between  different  data 
points.  A  parameter  is  needed  for  the  strength  of  the  constraints.  Note  that  we  do  not  compare  our 
approach  with  the  algorithm  in  [161],  because  its  implementation  is  no  longer  available. 

5.6.1  Experimental  Result  on  Synthetic  Data 

Our  first  experiment  is  based  on  the  example  in  Figure  5.3(a),  which  contains  400  points  generated 
by  four  Gaussians  centered  at  g  ,  _g  ,  _g  and  gZ  ,  each  with  identity  covariance  matrix. 
Recall  that  the  goal  is  to  group  this  data  set  into  two  clusters  -  a  “left”  and  a  “right”  cluster  - 
based  on  the  two  must-link  constraints.  Specifically,  points  with  negative  and  positive  horizontal 
co-ordinates  are  intended  to  be  in  two  different  clusters.  Note  that  this  synthetic  example  differs 
from  the  similar  one  in  [161]  in  that  the  vertical  separation  between  the  top  and  bottom  point  clouds 
is  larger.  This  increases  the  difference  between  the  goodness  of  the  “left/right”  and  “top/bottom” 
clustering  solutions,  so  that  a  small  number  of  constraints  is  no  longer  powerful  enough  to  bias 
one  clustering  solution  over  the  other  as  in  [161].  The  results  of  running  the  algorithms  Shental 
and  Basu  are  shown  in  Figures  5.4(a)  and  5.4(b),  respectively.  For  Shental  the  two  Gaussians 
estimated  are  also  shown.  Not  only  did  both  algorithms  fail  to  recover  the  desired  cluster  structure, 
but  also  the  cluster  assignments  found  were  counter-intuitive.  This  failure  is  due  to  the  fact  that 
these  two  approaches  represent  the  constraints  by  imposing  prior  distributions  on  the  cluster  labels, 
as  explained  earlier  in  Section  5.2. 

The  result  of  applying  the  proposed  algorithm  to  this  data  set  with  A  =  250  is  shown  in  Fig¬ 
ure  5.4(c).  The  two  desired  clusters  have  been  almost  perfectly  recovered,  when  we  compare  the 
solution  visually  with  the  desired  cluster  structure  in  Figure  5.3(c).  A  more  careful  comparison  is 
done  in  Figure  5.4(d),  where  the  cluster  boundaries  obtained  by  the  proposed  algorithm  (the  gray 
dotted  line)  is  compared  with  the  ground-truth  (the  solid  green  line).  We  can  see  that  these  two 
boundaries  are  very  close  to  each  other,  indicating  that  the  proposed  algorithm  discovered  a  good 
cluster  boundary.  This  compares  with  the  similar  example  in  [167],  where  the  cluster  boundary 
there  (as  inferred  from  the  Gaussians  shown)  is  quite  different^  from  the  desired  cluster  boundary. 
An  additional  cluster  boundary  obtained  by  the  proposed  algorithm  when  r  took  the  intermediate 
value  of  1  is  also  shown  (the  magenta  dashed  line).  (The  final  cluster  boundary  was  produced  with 
t  =  4.)  This  boundary  is  significantly  different  from  the  ground-truth  boundary.  So,  a  large  value 
of  r  improves  the  clustering  result  in  this  case.  This  improvement  is  the  consequence  of  the  fact 
that  a  large  r  focuses  on  the  cluster  assignments  of  the  objects  and  reduces  the  spurious  influence 
of  the  exact  locations  of  the  points.  The  Jensen-Shannon  divergence  measures  the  constraint  vio¬ 
lation/satisfaction  more  accurately.  Note  that  a  larger  value  of  r  does  not  have  any  further  visible 
effect  on  the  cluster  boundary. 

3The  url  is  http://www.cs.huji.ac.il/-tomboy/code/ConstrainedEM_plusBNT.zip. 

4 Its  implementation  is  available  at  http://www.cs.utexas.edu/users/ml/risc/code/. 

5Note  that  the  synthetic  data  example  in  [167]  is  fitted  with  a  mixture  model  with  different  covariance  matrices 
per  class.  Therefore,  comparing  it  with  the  proposed  algorithm  may  not  be  the  most  fair. 
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The  Gaussian  distributions  contributing  to  these  cluster  boundaries  are  shown  in  Figure  5.4(e). 
We  observe  that  the  Gaussians  recovered  by  the  proposed  algorithm  (dotted  gray  lines)  are  slightly 
“fatter”  than  those  obtained  with  the  ground-truth  labels  (solid  green  lines).  This  is  because  data 
points  not  in  a  particular  cluster  can  still  contribute,  though  to  a  smaller  extent,  to  the  covariance 
of  the  Gaussian  distributions  due  to  the  soft-assignment  implied  in  the  mixture  model.  This  is  not 
the  case  when  the  covariance  matrix  is  estimated  based  on  the  ground-truth  labels. 

While  the  proposed  algorithm  is  the  only  clustering  under  constraints  algorithm  we  know  that 
can  return  the  two  desired  clusters,  we  want  to  note  that  a  sufficiently  large  A  is  needed  for  its 
success.  If  A  =  50,  for  example,  the  result  of  the  proposed  algorithm  is  shown  in  Figure  5.4(f).  This 
is  virtually  identical  to  the  clustering  solution  without  any  constraints  (Figure  5.3(b)).  While  the 
constraints  are  violated,  the  clustering  solution  is  more  “reasonable”  than  the  solutions  shown  in 
Figures  5.4(a)  and  5.4(b).  Note  that  it  is  easy  to  detect  that  A  is  too  small  in  this  example,  because 
the  constraints  are  violated.  We  should  increase  A  until  this  is  no  longer  the  case.  The  resulting 
clustering  solution  will  effectively  be  identical  to  the  desired  solution  shown  in  Figure  5.4(c). 

5.6.2  Experimental  Results  on  Real  World  Data 

We  have  also  compared  the  proposed  algorithm  with  the  algorithms  Shental  and  Basu  based  on  real 
world  data  sets  obtained  from  different  domains.  The  label  information  in  these  data  sets  is  used 
only  for  the  creation  of  the  constraints  and  for  performance  evaluation.  In  particular,  the  labels  are 
not  used  by  the  clustering  algorithms. 

5.6.2. 1  Data  Sets  Used 

Table  5.2  summarizes  the  characteristics  of  the  data  sets  used.  The  following  preprocessing  has 
been  applied  to  the  data  whenever  necessary.  If  a  data  set  has  a  nominal  feature  that  can  assume 
c  possible  values  with  c  >  2,  that  feature  is  converted  into  c  continuous  features.  The  *-th  such 
feature  is  set  to  one  when  the  nominal  feature  assumes  the  i-th  possible  value,  and  the  remaining 
c  —  1  continuous  features  are  set  to  zero.  If  the  variances  of  the  features  of  a  data  set  are  very 
different,  standardization  is  applied  to  all  the  features,  so  that  the  variances  or  the  ranges  of  the 
preprocessed  features  become  the  same.  If  the  number  of  features  is  too  large  when  compared 
with  the  number  of  data  points  n,  principal  component  analysis  (PCA)  is  applied  to  reduce  the 
dimensionality.  The  number  of  reduced  dimension  d  is  determined  by  finding  the  largest  d  that 
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satisfies  n  >  3d  ,  while  the  principal  components  with  negligible  eigenvalues  are  also  discarded.  The 
difficulty  of  the  classification  tasks  associated  with  these  data  sets  can  be  seen  by  the  values  of  the 
F-score  and  the  normalized  mutual  information  (to  be  defined  in  Section  5. 6. 2. 3)  computed  using 
the  ground-truth  labels,  under  the  assumption  that  the  class  conditional  densities  are  Gaussian  with 
common  covariance  matrices. 

Data  Sets  from  UCI  The  following  data  sets  are  obtained  from  the  UCI  machine  learning 
repository”.  The  list  below  includes  most  of  the  data  sets  in  the  repository  that  have  mostly 
continuous  features  and  have  relatively  balanced  class  sizes. 

The  dermatology  database  (derm)  contains  366  cases  with  34  features.  The  goal  is  to  determine 
the  type  of  Erythemato-Squamous  disease  based  on  the  features  extracted.  The  age  attribute,  which 
has  missing  values,  is  removed.  PCA  is  performed  to  reduce  the  resulting  33  dimensional  data  to  11 
features.  The  sizes  of  the  six  classes  are  112,  61,  72,  49,  52  and  20. 


®The  url  is  http://www.ics.uci.edu/~inlearn/MLRepository.htrnl 
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The  optical  recognition  of  handwritten  digits  data  set  (digits)  is  based  on  normalized  bitmaps 
of  handwritten  digits  extracted  from  a  preprinted  form.  The  32x32  bitmaps  are  divided  into  non¬ 
overlapping  blocks  of  4x4  and  the  number  of  pixels  are  counted  in  each  block.  Thus  64  features  are 
obtained  for  each  digit.  The  training  and  testing  sets  are  combined,  leading  to  5620  patterns.  PCA 
is  applied  to  reduce  the  dimensionality  to  42  to  preserve  99%  of  the  total  variance.  The  sizes  of  the 
ten  classes  are  554,  571,  557,  572,  568,  558,  558,  566,  554,  and  562. 

The  ionosphere  data  set  (ion)  consists  of  351  radar  readings  returned  from  the  ionosphere, 
seventeen  pulse  numbers  are  extracted  from  each  reading.  The  real  part  and  the  imaginary  part  of 
the  complex  pulse  numbers  constitute  the  34  features  per  pattern.  There  are  two  classes:  “good” 
radar  returns  (225  patterns)  are  those  showing  evidence  of  some  type  of  structure  in  the  ionosphere, 
and  “bad”  returns  (126  patterns)  are  those  that  do  not;  their  signals  pass  through  the  ionosphere. 
PCA  is  applied  to  reduce  the  dimensionality  to  10. 

The  multi-feature  digit  data  set  consists  of  features  of  handwritten  numerals  extracted  from  a 
collection  of  Dutch  utility  maps.  Multiple  types  of  features  have  been  extracted.  We  have  only 
used  the  features  based  on  the  76  Fourier  coefficients  of  the  character  shapes.  The  resulting  data 
set  is  denoted  by  mf  eat-f  ou.  There  are  200  patterns  per  digit  class.  PCA  is  applied  to  reduce  the 
dimensionality  to  16,  which  preserves  95%  of  the  total  energy. 

The  Wisconsin  breast  cancer  diagnostic  data  set  (wdbc)  has  two  classes:  benign  (357  cases)  and 
malignant  (212  cases).  The  30  features  are  computed  from  a  digitized  image  of  the  breast  tissue, 
which  describes  the  characteristics  of  the  cell  nuclei  present  in  the  image.  All  the  features  are 
standardized  to  have  mean  zero  and  variance  one.  PCA  is  applied  to  reduce  the  dimensionality  of 
the  data  to  14. 

The  UCI  image  segmentation  data  set  (UCI-seg)  contains  19  continuous  attributes  extracted 
from  random  3x3  regions  of  seven  outdoor  images.  One  of  the  features  has  zero  variation  and  is 
discarded.  The  training  and  testing  sets  are  combined  to  form  a  data  set  with  2310  patterns.  After 
standardizing  each  feature  to  have  variance  one,  PCA  is  applied  to  reduce  the  dimensionality  of 
the  data  to  10.  The  seven  classes  correspond  to  brick-face,  sky,  foliage,  cement,  window,  path,  and 
grass.  Each  of  the  classes  has  330  patterns. 


Data  Sets  from  Statlog  in  UCI  The  following  five  data  sets  are  taken  from  the  Statlog  section"'7 
in  the  UCI  machine  learning  repository. 

The  Australian  credit  approval  data  set  (austra)  has  690  instances  with  14  attributes.  The 
two  classes  are  of  size  383  and  307.  The  continuous  features  are  standardized  to  have  standard 
deviation  0.5.  Four  of  the  features  are  non-binary  nominal  features,  and  they  are  converted  to 
multiple  continuous  features.  PCA  is  then  applied  to  reduce  the  dimensionality  of  the  concatenated 
feature  vector  to  15. 

The  German  credit  data  (german)  contains  1000  records  with  24  features.  The  version  with 
numerical  attributes  is  used  in  our  experiments.  PCA  is  used  to  reduce  the  dimensionality  of  the 
data  to  18,  after  standardizing  the  features  so  that  all  of  them  lie  between  zero  and  one.  The  two 
classes  have  700  and  300  records. 

The  heart  data  set  (heart)  has  270  observations  with  13  raw  features  in  two  classes  with  150  and 
120  data  points.  The  three  nominal  features  are  converted  into  continuous  features.  The  continuous 
features  are  standardized  to  have  standard  deviation  0.5,  before  applying  PCA  to  reduce  the  data 
set  to  9  features. 


'The  url  is  http://www.ics.uci.edu/~nilearn/databases/statlog/ 
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The  satellite  image  data  set  (sat)  consists  of  the  multi-spectral  values  of  pixels  in  3x3  neighbor¬ 
hoods  in  a  satellite  image.  The  aim  is  to  classify  the  class  associated  with  the  central  pixel,  which 
can  be  “red  soil”,  “cotton  crop”,  “grey  soil”,  “damp  grey  soil”,  “soil  with  vegetation  stubble”  or 
“very  damp  grey  soil”.  The  training  and  the  testing  sets  are  combined  to  yield  a  data  set  of  size 
6435.  There  are  36  features  altogether.  The  classes  are  of  size  1533,  703,  1358,  626,  707  and  1508. 

The  vehicle  silhouettes  data  set  (vehicle)  contains  a  set  of  features  extracted  from  the  silhouette 
of  a  vehicle.  The  goal  is  to  classify  a  vehicle  as  one  of  the  four  types  (Opel,  Saab,  bus,  or  van)  based 
on  the  silhouette.  There  are  altogether  846  patterns  in  the  four  classes  with  sizes  of  the  four  classes 
as  212,  217,  218,  and  199.  The  features  are  first  standardized  to  have  standard  deviation  one,  before 
applying  PCA  to  reduce  the  dimensionality  to  16. 


Other  Data  Sets  We  have  also  experimented  the  proposed  algorithm  with  data  sets  from  other 
sources. 

The  texture  classification  data  set  (texture)  is  taken  from  [127].  It  consists  of  4000  patterns 
with  four  different  types  of  textures.  The  19  features  are  based  on  Gabor  filter  responses.  The  four 
classes  are  of  sizes  987,  999,  1027,  and  987. 

The  online  handwritten  script  data  set  (script),  taken  from  [192],  is  about  a  problem  that 
classifies  words  and  lines  in  an  online  handwritten  document  into  one  of  the  six  major  scripts: 
Arabic,  Cyrillic,  Devnagari,  Han,  Hebrew,  and  Roman.  Eleven  spatial  and  temporal  features  are 
extracted  from  the  strokes  of  the  words.  There  are  altogether  12938  patterns,  and  the  sizes  of  the 
six  classes  are  1190,  3173,  1773,  3539,  1002,  and  2261. 

The  ethnicity  recognition  data  set  (ethn)  was  originally  used  in  [175].  The  goal  is  to  classify 
a  64x64  face  image  into  two  classes:  “Asian”  (1320  images)  and  “non-Asian”  (1310  images).  It 
includes  the  PF01  database^,  the  Yale  database^,  the  AR  database  [181],  and  the  non-public  NLPR 
database^®.  Some  example  images  are  shown  in  Figure  5.5.  30  eigenface  coefficients  are  extracted 
to  represent  the  images. 

The  clustering  under  constraints  algorithm  is  also  tested  on  an  image  segmentation  task  based  on 
the  Mondrian  image  shown  in  Figure  5.6,  which  has  five  distinct  segments.  The  image  is  divided  into 
101  by  101  sites.  Twelve  histogram  features  and  twelve  Gabor  filter  responses  of  four  orientations 
at  three  different  scales  are  extracted.  Because  the  histogram  features  always  sum  to  one,  PCA  is 
performed  to  reduce  the  dimension  from  24  to  23.  The  resulting  data  set  Mondrian  contains  10201 
patterns  with  23  features  in  5  classes.  The  sizes  of  the  classes  are  2181,  2284,  2145,  2323,  and  1268. 

The  3- newsgroup  database^ ^  is  about  the  classification  of  Usenet  articles  from  different  news- 
groups.  It  has  been  used  previously  to  demonstrate  the  effectiveness  of  clustering  under  constraints 
in  [14].  It  consists  of  three  classification  tasks  (diff-300,  sim-300,  same-300),  each  of  which  con¬ 
tains  roughly  300  documents  from  three  different  topics.  The  topics  are  regarded  as  the  classes  to 
be  discovered.  The  three  classification  tasks  are  of  different  difficulties:  the  sets  of  three  topics  in 
diff-300,  sim-300,  and  same-300  respectively  have  increasing  similarities.  Latent  semantic  index¬ 
ing  is  applied  to  the  tf-idf  normalized  word  features  to  convert  each  newsgroup  article  into  a  feature 
vector  of  dimension  10.  The  three  classes  in  diff-300  are  all  of  sizes  100,  whereas  the  number  of 
patterns  in  the  three  classes  in  sim-300  is  96,  97,  and  98.  The  sizes  of  the  classes  in  same-300  are 
99,  98,  and  100. 


8http : / /nova. postech. ac .kr/archives/imdb .html. 

9http : //cvc .yale . edu/projects/yalef aces/yalef aces .html. 

10Provided  by  Dr.  Yunhong  Wang,  National  Laboratory  for  Pattern  Recognition,  Beijing. 
11It  can  be  downloaded  at  http://www.cs.utexas.edu/users/ml/risc/. 
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Notice  that  the  data  sets  ethn,  Mondrian,  diff-300,  sim-300,  and  same-300  have  been  used  in 
the  previous  work  [161].  The  same  preprocessing  is  applied  for  both  ethn  and  Mondrianas  in  [161], 
though  we  reduce  the  dimensionality  of  the  data  set  from  20  to  10  for  the  diff-300,  sim-300,  and 

O 

same-300  data  sets  based  on  our  “n  >  3 d  rule. 

5. 6. 2. 2  Experimental  Procedure 

For  each  data  set  listed  in  Table  5.2,  a  constraint  was  specified  by  first  generating  a  random  point 
pair  (y,t  ■  Yj ) •  If  the  ground-truth  class  labels  of  and  y j  were  the  same,  a  must-link  constraint  was 
created  between  y^  and  y j .  Otherwise,  a  must-not-link  constraint  was  created.  Different  numbers 
of  constraints  were  created  as  a  percentage  of  the  number  of  points  in  the  data  set:  1%,  2%,  3%, 
5%,  10%,  and  15%.  Note  that  the  constraints  were  generated  in  a  “cumulative”  manner:  the  set  of 
“1%”  constraints  was  included  in  the  set  of  “2%”  constraints,  and  so  on. 

The  line-search  Newton  algorithm  was  used  to  optimize  the  objective  function  J  in  the  proposed 
approach.  The  Gaussians  were  represented  by  the  natural  parameters  i ' j  and  T,  with  a  common 
precision  matrix  among  different  Gaussian  components.  This  particular  choice  of  optimization  al¬ 
gorithm  was  made  based  on  a  preliminary  efficiency  study,  where  this  approach  was  found  to  be  the 
most  efficient  among  all  the  algorithms  described  in  Section  5.4.1.  Because  the  gradient  is  avail¬ 
able  in  line-search  Newton,  convergence  was  decided  when  the  norm  of  the  gradient  was  less  than 
a  threshold  of  the  norm  of  the  initial  gradient.  Note  that  this  is  a  stricter  and  more  reasonable 
convergence  criteria  than  the  one  typically  used  in  the  EM  algorithm,  which  is  based  on  the  relative 
change  of  log-likelihood.  However,  in  order  to  safeguard  against  round-off  error,  we  also  declare 
convergence  when  the  relative  change  of  the  objective  function  is  very  small:  lCW^®,  to  be  precise. 
Starting  with  a  random  initialization,  line-search  Newton  was  run  with  7  =  1  and  r  =  0.25,  with 

_ O 

the  convergence  threshold  set  to  10  .  Line-search  Newton  was  run  again  after  increasing  r  to  1, 

with  the  convergence  threshold  tightened  to  10— Finally,  r  and  the  convergence  threshold  were 
set  to  4  and  10  respectively.  The  optimization  algorithm  was  also  stopped  if  convergence  was 
not  achieved  within  5000  Newton  iterations.  Fifteen  random  initializations  were  attempted.  The 
solution  with  the  best  objective  function  value  was  regarded  as  the  solution  found  by  the  proposed 
algorithm. 

The  above  procedure,  however,  assumes  the  constraint  strength  A  is  known.  The  value  of  A  was 
determined  using  a  set  of  validation  constraints.  The  constraints  for  training  set  and  the  constraints 
for  validation  set  were  obtained  using  the  following  rules.  Given  a  data  set,  if  the  number  of 
constraints  was  less  than  3 k,  k  being  the  number  of  clusters,  all  the  available  constraints  were  used 
for  training  and  validation.  This  procedure,  while  risking  overfitting,  is  necessary  because  a  too  small 
set  of  constraints  is  poor  for  training  the  clusters  as  well  as  the  estimation  of  A.  When  the  number 
of  constraints  was  between  3 k  and  6k,  the  number  of  training  constraints  and  validation  constraints 
were  both  set  to  3 k.  So,  the  training  constraints  overlapped  with  the  validation  constraints.  When 
the  number  of  constraints  was  larger  than  6k,  half  the  constraints  were  used  for  training  and  the 
other  half  were  used  for  validation.  Starting  with  A  =  0.1,  we  increased  A  by  multiplying  it  by  '/W. 
For  each  A,  the  proposed  algorithm  was  executed.  A  better  value  of  A  was  encountered  if  the  number 
of  violations  of  the  validation  constraints  was  smaller  than  the  current  best.  If  there  was  a  tie,  the 
decision  was  made  on  the  number  of  violations  of  the  training  constraints.  If  the  best  value  of  A  did 
not  change  for  four  iterations,  we  assumed  that  the  optimal  value  of  A  was  found.  The  proposed 
algorithm  was  executed  again  using  all  the  available  constraints  and  A  value  just  determined.  The 
resulting  solution  was  compared  with  the  solution  obtained  using  only  the  training  constraints,  and 
the  one  with  the  smaller  total  number  of  constraint  violations  was  regarded  as  our  final  clustering 
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Full  name 

Source 

n 

F 

NMI 

derm 

dermatology 

UCI 

366 

■an 

0.9648 

0.9258 

digits 

optical  recognition  of  handwritten  digits 

UCI 

5620 

0.9516 

0.8915 

ion 

ionosphere 

UCI 

351 

wmm 

0.8519 

0.4003 

mf eat-f ou 

multi- feature  digit,  Fourier  coefficients 

UCI 

2000 

0.7999 

0.7369 

UCI-seg 

UCI  Image  segmentation 

UCI 

2310 

wmmm 

0.8445 

0.7769 

wdbc 

Wisconsin  breast  cancer  diagnostic 

UCI 

569 

wmmm 

0.9645 

0.8047 

austra 

Australian  credit  approval 

Statlog  in  UCI 

690 

m&wM 

0.8613 

0.4384 

german 

German  credit 

Statlog  in  UCI 

1000 

I3H 

0.7627 

0.1475 

heart 

heart 

Statlog  in  UCI 

270 

MM*M 

0.8550 

0.4010 

sat 

satellite  image 

Statlog  in  UCI 

6435 

0.8382 

0.7176 

vehicle 

vehicle  silhouettes 

Statlog  in  UCI 

846 

Kan 

0.7869 

0.5850 

script  online  handwritten  script 

texture  texture 

ethn  ethnicity 

Mondrian  Mondrian  image  segmentation 

diff-300  Usenet  newsgroup  (highly  different) 
sim-300  Usenet  newsgroup  (somewhat  similar) 
same-300  Usenet  newsgroup  (highly  similar) 


12938 
4000  19 

2630  30 

10201  23 

300  To" 
291  10 

297  To" 


0.7673 

0.9820 

0.9627 

0.9696 

0.9432 

0.6996 

0.7825 


0.5812 

0.9274 

0.7704 

0.9042 

0.7895 

0.3290 

0.4071 


Table  5.2:  Summary  of  the  real  world  data  sets  used  in  the  experiments.  The  number  of  data  points  and  the  number  of  actual  features  used  are 
represented  by  n  and  d,  respectively.  The  difficulty  of  the  classification  task  associated  with  a  data  set  can  be  seen  by  the  F-score  (denoted  by  F)  and 
the  normalized  mutual  information  (denoted  by  NMI).  These  two  criteria  are  defined  in  Section  5. 6. 2. 3.  Higher  values  of  F  and  NMI  indicate  that 
the  associated  classification  task  is  relatively  easier,  0  <  F  <  1  and  0  <  NMI  <  1. 
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(a)  Result  of  the  algorithm  Shental  (b)  Result  of  the  algorithm  Basu 


Figure  5.4:  The  result  of  running  different  clustering  under  constraints  algorithms  for  the  synthetic 
data  set  shown  in  Figure  5.3(a).  While  the  algorithms  Shental  and  Basu  failed  to  discover  the 
desired  clusters  ((a)  and  (b)),  the  proposed  algorithm  succeeded  with  A  =  250  (c).  The  resulting 
cluster  boundaries  and  Gaussians  are  compared  with  those  estimated  with  the  ground-truth  labels 
((d)  and  (e)).  When  A  =  50,  the  proposed  algorithm  returned  the  natural  clustering  solution  (f). 
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(a)  Asians 


Non- Asians 


Figure  5.5:  Example  face  images  in  the  ethnicity  classification  problem  for  the  data  set  ethn. 


Figure  5.6:  The  Mondrian  image  used  for  the  data  set  Mondrian.  It  contains  5  segments.  Three  of 
the  segments  are  best  distinguished  by  Gabor  filter  responses,  whereas  the  remaining  two  are  best 
distinguished  by  their  gray-level  histograms. 


solution.  If  there  was  a  tie,  the  solution  obtained  with  training  constraints  only  was  selected. 

The  algorithms  Shental  and  Basu  were  run  using  the  same  set  of  data  and  constraints  as  input. 
For  Shental,  we  modified  the  initialization  strategy  in  their  software,  which  involved  a  two  step 
process.  First,  five  random  parameter  vectors  were  generated,  and  the  one  with  the  highest  log- 
likelihood  was  selected  as  the  initial  value  of  the  EM  algorithm.  Convergence  was  achieved  if 

_ C 

the  relative  change  in  the  log-likelihood  was  less  than  a  threshold,  which  is  10  u  by  default.  This 
process  was  repeated  15  times,  and  the  parameter  vector  with  the  highest  log-likelihood  was  regarded 
as  the  solution.  For  easier  comparison,  we  also  assumed  a  common  covariance  matrix  among  the 
different  Gaussian  components.  For  the  algorithm  Basu,  the  authors  provided  their  own  initialization 
strategy,  which  was  based  on  the  set  of  constraints  provided.  The  algorithm  was  run  15  times,  and 
the  solution  with  the  best  objective  function  was  picked.  The  algorithm  Basu  requires  a  constraint 
penalty  parameter.  In  our  experiment,  a  wide  range  of  values  were  tried:  1,  2,  4,  8,  16,  32,  64,  128, 
256,  500,  1000,  2000,  4000,  8000,  16000.  We  only  report  their  results  with  the  best  possible  penalty 
values.  As  a  result,  the  performance  of  Basu  reported  here  might  be  inflated. 


5. 6. 2. 3  Performance  criteria 

A  clustering  under  constraints  algorithm  is  said  to  perform  well  on  a  data  set  if  the  clusters  obtained 
are  similar  to  the  ground-truth  classes.  Consider  the  k  by  k  “contingency  matrix”  {c^j},  where  c^j 
denotes  the  number  of  data  points  that  are  originally  from  the  i-th  class  and  are  assigned  to  the 
j- th  cluster.  If  the  clusters  match  the  true  classes  perfectly,  there  should  only  be  one  non-zero  entry 
in  each  row  and  each  column  of  the  contingency  matrix. 


Following  the  common  practice  in  the  literature,  we  summarize  the  contingency  matrix  by  the 
F-score  and  the  normalized  mutual  information  (NMI).  Consider  the  “recall  matrix”  {fy  }  in  which 
the  entries  are  defined  by  r^j  =  cb  •/.  Intuitively,  f^j  denotes  the  proportion  of  the  *-th  class 

that  is  “recalled”  by  the  j- th  cluster.  The  “precision  matrix”  {p.jj},  on  the  other  hand,  is  defined 
by  fJij  =  c^/E^c.,..  It  represents  how  “pure”  the  j- th  cluster  is  with  respect  to  the  i-th  class. 
Entries  in  the  F-score  matrix  {fij}  are  simply  the  harmonic  mean  of  the  corresponding  entries  in 
the  precision  and  recall  matrices,  i.e. ,  ftj  =  2 +  rp;Lj ) .  The  F-score  of  the  i-th  class,  F^,  is 
obtained  by  assuming  that  the  i-th  class  matches  with  the  best  cluster,  i.e.,  F. j  =  rnaxj  f-j.  The 
overall  F-score  is  computed  as  the  weighted  sum  of  the  individual  F)  according  to  the  sizes  of  the 
true  classes,  i.e., 


k 

F-score  = 

i= 1 


£j=i 


Fa 


(5.34) 


Note  that  the  precision  of  an  empty  cluster  is  undefined.  This  problem  can  be  circumvented  if  we 
restrict  that  empty  clusters,  if  any,  should  not  contribute  to  the  overall  F-score. 

The  computation  of  normalized  mutual  information  interprets  the  true  class  label  and  the  cluster 
label  as  two  random  variables  U  and  V.  The  contingency  table,  after  dividing  by  n  (the  number  of 
objects),  forms  the  joint  distribution  of  U  and  V.  The  mutual  information  (MI)  between  U  and  V 
can  be  computed  based  on  the  joint  distribution.  Since  the  range  of  the  mutual  information  depends 
on  the  sizes  of  the  true  classes  and  the  sizes  of  the  clusters,  we  normalize  the  MI  by  the  average  of 
the  entropies  of  U  and  V  (denoted  by  H(U)  and  H(V))  so  that  the  resulting  value  lies  between  zero 


12  Here,  we  do  not  require  that  one  cluster  can  only  match  to  one  class.  If  this  one-to-one  correspondence  is  desired, 
the  Hungarian  algorithm  should  be  used  to  perform  the  matching  instead  of  the  max  operation  to  compute  Ft  . 
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and  one.  Formally,  we  have 


H(U)  = 
H(V)  = 

H{U,  V)  = 

MI  = 
NMI  = 


n 


k  yk  g. .  yk  g. . 

r— \  n  n 

i=l 

k  y^/c  7. .  sy k  . 

_  y  ^ =1  13  log  ~i=1  13 

'  n 
3= 1 

k  k  %. .  x. . 

-EE^1^ 

z — '  z — '  n  n 

i=i  j=i 

H  (U)  +  H(V)  —  H(U,V) 

MI 

WWP' 


(5.35) 


For  both  F-score  and  NMI,  the  higher  the  value,  the  better  the  match  between  the  clusters 
and  the  true  classes.  For  a  perfect  match,  both  NMI  and  F-score  take  the  value  of  1.  When  the 
cluster  labels  are  completely  independent  of  the  class  labels,  NMI  takes  its  smallest  value  of  0.  The 
minimum  value  of  F-score  depends  on  the  sizes  of  the  true  classes.  If  all  the  classes  are  of  equal 
sizes,  the  lower  bound  of  F-score  is  1/k.  In  general,  the  lower  bound  of  F-score  is  higher,  and  it  can 
be  more  than  0.5  if  there  is  a  dominant  class. 


5. 6. 2. 4  Results 

The  results  of  clustering  the  data  sets  mentioned  in  Section  5. 6. 2.1  when  there  are  no  constraints 
are  shown  in  Table  5.3.  In  the  absence  of  constraints,  both  the  proposed  algorithm  and  Shental 
effectively  find  the  cluster  parameter  vector  that  maximizes  the  log-likelihood,  whereas  Basu  is  the 
same  as  the  fc-means  algorithm.  One  may  be  surprised  to  discover  from  Table  5.3  that  even  though 
the  proposed  algorithm  and  Shental  optimize  the  same  objective  function,  their  results  are  different. 
This  is  understandable  when  we  notice  that  the  line-search  Newton  algorithm  used  by  the  proposed 
approach  and  the  EM  algorithm  used  by  Shental  can  locate  different  local  optima.  It  is  sometimes 
argued  that  maximizing  the  mixture  log-likelihood  globally  is  inappropriate  as  it  can  go  to  infinity 
when  one  of  the  Gaussian  components  has  an  almost  singular  covariance  matrix.  However,  this  is  not 
the  case  here,  because  the  covariance  matrices  all  have  small  condition  numbers  as  seen  in  Table  5.3. 
Therefore,  among  the  two  solutions  produced  by  the  proposed  approach  and  by  Shental,  we  take 
the  one  with  the  larger  log-likelihood.  In  the  remaining  experiments,  the  no-constraint  solutions 
found  by  the  proposed  algorithm  were  also  used  as  the  initial  value  for  Shental.  It  is  because  we 
are  interested  in  locating  the  best  possible  local  optima  for  the  objective  functions. 

The  results  of  running  our  proposed  algorithm,  Shental,  and  Basu,  with  1%  constraint  level,  2% 
constraint  level,  3%  constraint  level,  5%  constraint  level,  10%  constraint  level,  and  15%  constraint 
level  are  shown  in  Tables  5.4,  5.5,  5.6,  5.7,  5.8,  and  5.9,  respectively.  In  these  tables,  the 
columns  under  “Proposed”  correspond  to  the  performance  of  the  proposed  algorithm.  The  heading 
A  denotes  the  value  of  the  constraint  strength  as  determined  by  the  validation  procedure.  The 
heading  “Shental,  default  init”  corresponds  to  the  performance  when  the  algorithm  Shental  is 
initialized  by  its  default  strategy,  whereas  “Shental,  special  init”  corresponds  to  the  result  when 
Shental  is  initialized  by  the  no-constraint  solution  found  by  the  proposed  approach.  The  heading 
“log-lik”  shows  the  log-likelihood  of  the  resulting  parameter  vector.  Among  these  two  solutions  of 
Shental,  the  one  with  a  higher  log- likelihood  is  selected,  and  its  performance  is  shown  under  the 
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Table  5.3:  Performance  of  different  clustering  algorithms  in  the  absence  of  constraints.  Both  the  proposed  algorithm  and  Shental  maximize  the 
log-likelihood  in  this  case,  though  the  former  uses  the  line-search  Newton  whereas  the  latter  uses  the  EM  algorithm.  The  headings  n,  F,  NMI,  log-lik 
and  K  denote  the  number  of  data  points,  the  F-score,  the  normalized  mutual  information,  the  log-likelihood,  and  the  condition  number  of  the  common 
covariance  matrix,  respectively. 


heading  “Shental,  combine”. 

From  these  tables,  we  can  see  that  Shental  with  default  initialization  often  yields  a  higher 
performance  than  Shental  with  the  special  initialization.  However,  the  log-likelihood  of  Shental 
with  default  initialization  is  sometimes  smaller.  By  the  principle  of  maximum  likelihood,  such  a 
solution,  though  it  has  a  higher  F-score  and/or  NMI,  should  not  be  accepted.  This  observation  has 
the  implication  that  the  good  performance  of  Shental  as  reported  in  comparative  work  such  as  in 
[161]  might  be  due  to  the  initialization  strategy  instead  of  the  model  used.  The  fact  that  we  are  more 
interested  in  comparing  the  model  used  in  Shental  with  that  used  in  the  proposed  approach,  instead 
of  the  strategy  for  initialization,  is  the  reason  why  we  run  Shental  with  the  special  initialization. 
We  have  also  tried  to  do  something  similar  with  Basu,  but  its  initialization  routine  is  integrated 
with  the  main  clustering  routine  so  that  it  is  non-trivial  to  modify  the  initialization  strategy. 

The  numbers  listed  in  Tables  5.3  to  5.9  are  visualized  in  Figures  5.7  to  5.13.  For  each  data 
set,  we  draw  the  F-score  and  the  NMI  with  an  increasing  number  of  constraints.  The  horizontal 
axis  corresponds  to  different  constraint  levels  in  terms  of  the  percentages  of  the  number  of  data 
points,  whereas  the  vertical  axis  corresponds  to  the  F-score  or  the  NMI.  The  results  of  the  proposed 
algorithm,  Shental,  and  Basu  are  shown  by  the  (red)  solid  lines,  (blue)  dotted  lines,  and  (black) 
dashed  lines,  respectively.  For  comparison,  the  (gray)  dashdot  lines  in  the  figures  show  the  F-score 
and  the  NMI  due  to  a  classifier  trained  using  the  labels  of  all  the  objects  in  the  data  set  under 
the  assumption  that  the  class  conditional  densities  are  Gaussian  with  a  common  covariance  matrix. 
The  data  sets  are  grouped  according  to  the  performance  of  the  proposed  algorithms.  The  proposed 
algorithm  outperformed  both  Shental  and  Basu  for  the  data  sets  shown  in  Figures  5.7  to  5.9.  The 
performance  of  the  proposed  algorithm  is  comparable  to  its  competitors  for  the  data  sets  shown  in 
Figures  5.10  to  5.12.  For  the  data  sets  shown  in  Figures  5.13,  the  proposed  algorithm  is  slightly 
inferior  to  one  of  its  competitors.  We  shall  examine  the  performance  on  individual  data  sets  later. 

Perhaps  the  first  observation  from  these  figures  is  that  the  performance  is  not  monotonic,  i.e.,  the 
F-score  and  the  NMI  can  actually  decrease  when  there  are  additional  constraints.  This  is  counter¬ 
intuitive,  because  one  expects  improved  results  when  more  information  (in  the  form  of  constraints) 
is  fed  as  the  input  to  the  algorithms.  Note  that  this  lack  of  monotonicity  is  observed  for  all  the 
three  algorithms.  There  are  three  reasons  for  this.  First,  the  additional  constraints  can  be  based 
on  data  points  that  are  erroneously  labeled  (errors  in  the  ground  truth),  or  they  are  “outlier”  in 
the  sense  that  they  would  be  mis-classified  by  most  reasonable  supervised  classifiers  trained  with 
all  the  labels  known.  The  additional  constraints  in  this  case  serve  as  “mis- information” ,  and  it  can 
hurt  the  performance  of  the  clustering  under  constraints  algorithms.  This  effect  is  more  severe  for 
the  proposed  approach  when  there  are  only  a  small  number  of  constraints,  because  the  influence 
of  each  of  the  constraints  may  be  magnified  by  a  large  value  of  A.  The  second  reason  is  that  an 
algorithm  may  locate  a  poor  local  optima.  In  general,  the  larger  the  number  of  constraints,  the 
greater  the  number  of  local  optima  in  the  energy  landscape.  So,  the  proposed  algorithm  as  well  as 
Shentaland  Basu  is  more  likely  to  get  trapped  in  poor  local  optima.  This  trend  is  the  most  obvious 
for  Basu,  as  the  performance  at  10%  and  15%  constraint  levels  dropped  for  more  than  half  of  the 
data  sets.  This  is  not  surprising,  because  the  iterative  conditional  mode  used  by  Basu  is  greedy  and 
it  is  likely  to  get  trapped  in  local  optima.  The  third  reason  is  specific  to  the  proposed  approach.  It 
is  due  to  the  random  nature  of  the  partitioning  of  the  constraints  into  training  set  and  validation 
set.  If  we  have  an  unfavorable  split,  the  value  of  A  found  by  minimizing  the  number  of  violations 
on  the  set  of  validation  constraints  can  be  suboptimal.  In  fact,  we  observe  that  whenever  there  is  a 
significant  drop  in  the  F-score  and  NMI,  there  often  exists  a  better  value  of  A  than  the  one  found 
by  the  validation  procedure. 
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Performance  on  Individual  Data  Sets  The  result  on  the  ethn  data  set  can  be  seen  in  Fig¬ 
ures  5.7(a)  and  5.7(b).  The  performance  of  the  proposed  algorithm  improves  with  additional  con¬ 
straints,  and  it  outperforms  Shental  and  Basu  at  all  constraint  levels.  A  similar  phenomenon  occurs 
for  the  Mondrian  data  set  (Figures  5.7(c)  and  5.7(d))  and  the  ion  data  set  (Figures  5.7(e)  and  5.7(f)). 
For  Mondrian,  note  that  1%  constraint  level  is  already  sufficient  to  bias  the  cluster  parameter  to 
match  the  result  using  the  ground-truth  labels.  Additional  constraints  only  help  marginally.  The 
performance  of  the  proposed  algorithm  for  the  script  data  set  (Figures  5.8(a)  and  5.8(b))  is  better 
than  Shental  and  Basu  for  all  constraint  levels  except  1%,  where  the  proposed  algorithm  is  inferior 
to  the  result  of  Basu.  However,  given  how  much  better  the  k- means  algorithm  is  when  compared 
with  the  EM  algorithm  in  the  absence  of  constraints,  it  is  fair  to  say  that  the  proposed  algorithm  is 
doing  a  decent  job.  For  the  data  set  derm,  the  clustering  solution  without  any  constraints  is  pretty 
good:  that  solution,  in  fact,  satisfies  all  the  constraints  when  the  constraint  levels  are  1%  and  2%. 
Therefore,  it  is  natural  that  the  performance  does  not  improve  with  the  provision  of  the  constraints. 
However,  when  the  constraint  level  is  higher  than  2%,  the  proposed  algorithm  again  outperforms 
Shental  and  Basu  (Figures  5.8(c)  and  5.8(d)).  The  performance  of  the  proposed  algorithm  on  the 
vehicle  data  set  is  superior  to  Shental  and  Basu  for  all  constraint  levels  except  5%,  where  the 
performance  of  Shental  is  slightly  superior.  For  the  data  set  wdbc,  the  performance  of  the  proposed 
algorithm  (Figures  5.9(a)  and  5.9(b))  is  better  than  Shental  at  all  constraint  levels  except  5%.  The 
proposed  algorithm  outperforms  Basu  when  the  constraint  level  is  higher  than  1%. 

The  F-score  of  the  proposed  algorithm  on  the  UCI-seg  data  (Figures  5.10(a))  is  superior  to 
Shental  at  three  constraint  levels  and  is  superior  to  Basu  at  all  but  1%  constraint  level.  On  the 
other  hand,  if  NMI  is  used  (Figure  5.10(b)),  the  proposed  algorithm  does  not  do  as  well  as  the 
others.  For  the  heart  data  set,  the  proposed  algorithm  is  superior  to  Shental  at  all  constraint 
levels,  but  it  is  superior  to  Basu  at  only  3%  constraint  level  (Figures  5.10(c)  and  5.10(d)).  Note 
that  the  performance  of  Basu  might  be  inflated  because  we  only  report  its  best  results  among  all 
possible  values  of  constraint  penalty  in  this  algorithm.  We  can  regard  the  performance  of  the 
proposed  algorithm  on  the  austra  data  set  (Figures  5.10(e)  and  5.10(f))  as  a  tie  with  Shental  and 
Basu,  because  the  proposed  algorithm  outperforms  Shental  and  Basuat  three  out  of  six  possible 
constraint  levels.  For  the  german  data  set,  the  proposed  algorithm  performs  the  best  in  terms  of  NMI 
(Figure  5.11(b)),  though  the  performances  of  all  three  algorithms  are  not  that  good.  Apparently, 
this  is  a  difficult  data  set.  The  performance  of  the  proposed  algorithm  is  less  impressive  when  F-score 
is  used,  however  (Figure  5.11(a)).  The  proposed  algorithm  is  superior  to  Shental  in  performance 
for  the  sim-300  data  set  (Figures  5.11(c)  and  5.11(d)).  While  the  proposed  algorithm  has  a  tie 
in  performance  when  compared  with  Basu  based  on  the  F-score,  Basu  outperforms  the  proposed 
algorithm  on  this  data  set  when  NMI  is  used.  The  result  of  the  diff-300  data  set  (Figures  5.11(e) 
and  5.11(f))  is  somewhat  similar:  the  proposed  algorithm  outperforms  Shental  at  all  constraint 
levels,  but  it  is  inferior  to  Basu.  Given  the  fact  that  the  fc-means  algorithm  is  much  better  than 
EM  in  the  absence  of  constraints  for  this  data  set,  the  proposed  algorithm  is  not  as  bad  as  it  first 
seems.  For  the  sat  data  set  (Figures  5.12(a)  and  5.12(b)),  the  proposed  algorithm  outperforms 
Shental  and  Basu  significantly  in  terms  of  F-score  when  the  constraint  levels  are  10%  and  15%. 
The  improvement  in  NMI  is  less  significant,  though  the  proposed  method  is  still  the  best  at  three 
constraint  levels.  The  result  of  the  digits  data  set  (Figures  5.12(c)  and  5.12(d))  is  similar:  the 
proposed  method  is  superior  to  its  competitors  at  three  and  four  constraint  levels  if  F-score  and 
NMI  are  used  as  the  evaluation  criteria,  respectively. 

It  is  difficult  to  draw  any  conclusion  on  the  performance  of  the  three  algorithms  on  the  mf  eat-fou 
data  set  (Figures  5.13(a)  and  5.13(b)).  The  performances  of  all  three  algorithms  go  up  and  down 
with  an  increasing  number  of  constraints.  Apparently  this  data  set  is  fairly  noisy,  and  clustering  with 
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constraints  is  not  appropriate  for  this  data  set.  For  the  data  set  same -300,  the  proposed  algorithm 
does  not  perform  well:  it  has  a  tie  with  Shental,  but  it  is  inferior  to  Basil  at  all  constraint  levels, 
as  seen  in  Figures  5.13(c)  and  5.13(d).  The  performance  of  the  proposed  algorithm  is  better  than 
Shental  only  at  the  15%  constraint  level  for  the  data  set  texture  (Figures  5.13(e)  and  5.13(f)). 
The  proposed  algorithm  is  superior  to  Basu  for  this  data  set,  though  this  is  probably  due  to  the 
better  performance  of  the  EM  algorithm  in  the  absence  of  constraints.  Note  that  this  data  set  is 
a  relatively  easy  data  set  for  model-based  clustering:  both  fc-means  and  EM  have  a  F-score  higher 
than  0.95  when  no  constraints  are  used. 

5.6.3  Experiments  on  Feature  Extraction 

We  have  also  tested  the  idea  of  learning  the  low-dimensional  subspace  and  the  clusters  simulta¬ 
neously  in  the  presence  of  constraints.  Our  first  experiment  in  this  regard  is  based  on  the  data 
set  shown  in  Figure  5.3.  The  two  features  were  standardized  to  variance  one  before  applying  the 
algorithm  described  in  Section  5.5  with  the  two  must-link  constraints.  Based  on  the  result  shown  in 
Figure  5.14(a),  we  can  see  that  a  good  projection  direction  was  found  by  the  proposed  algorithm. 
The  projected  data  follow  the  Gaussian  distribution  well,  as  evident  from  Figure  5.14(b). 

Our  second  experiment  is  about  the  combination  of  feature  extraction  and  the  kernel  trick  to 
detect  clusters  with  general  shapes.  The  two-ring  data  set  (Figure  5.15(a))  considered  in  [158],  which 
used  a  hidden  Markov  random  held  approach  for  clustering  with  constraints  in  kernel  fc- means,  was 
used.  As  in  [158],  we  applied  the  RBF  kernel  to  transform  this  data  set  of  200  points  nonlinearly. 
The  kernel  width  was  set  to  0.2,  which  was  the  20-percentile  of  all  the  pairwise  distances.  Unlike 
[158],  we  applied  kernel  PCA  to  this  data  set  and  extracted  20  features.  The  algorithm  described  in 
Section  5.5  was  used  to  learn  a  good  projection  of  these  20  features  into  a  2D  space  while  clustering 
the  data  into  two  groups  simultaneously  in  the  presence  of  60  randomly  generated  constraints.  The 
result  shown  in  Figure  5.15(b)  indicates  that  the  algorithm  successfully  found  a  2D  subspace  such 
that  the  two  clusters  were  Gaussian-like,  and  all  the  constraints  were  satisfied.  When  we  plot  the 
cluster  labels  of  the  original  two-ring  data  set,  we  can  see  that  the  desired  clusters  (the  “inner” 
and  the  “outer”  rings)  were  recovered  perfectly  (Figure  5.15(c)).  Note  that  the  algorithm  described 
in  [158]  required  at  least  450  constraints  to  identify  the  two  clusters  perfectly,  whereas  we  have 
only  used  60  constraints.  For  comparison,  the  spectral  clustering  algorithm  in  [194]  was  applied  to 
this  data  set  using  the  same  kernel  matrix  as  the  similarity.  The  two  desired  clusters  could  not  be 
recovered  (Figure  5.15(d)).  In  fact,  the  two  desired  clusters  were  never  recovered  even  when  we  tried 
other  values  of  kernel  widths. 


5.7  Discussion 

5.7.1  Time  Complexity 

The  computation  of  the  objective  function  and  its  gradient  requires  the  calculation  of  r^j ,  Sjj,  vjjj , 
and  the  weighted  sum  of  different  sufficient  statistics  with  r^j  and  Wjj  as  weights.  When  compared 
with  the  EM  algorithm  for  standard  model-based  clustering,  the  extra  computation  by  the  proposed 
algorithm  is  due  to  Sjj,  Wjj ,  and  the  accumulation  of  the  corresponding  sufficient  statistics.  These 
take  0(kd(m+  +  m~  +n*))  time,  where  k,  d,  ro~*~,  m~ ,  n*  denote  the  number  of  clusters,  the 
dimension  of  the  feature  vector,  the  number  of  must-link  constraints,  the  number  of  must-not-link 
constraints,  and  the  number  of  data  points  involved  in  any  constraint,  respectively.  This  is  smaller 
than  the  O(kdn)  time  required  for  one  iteration  of  the  EM  algorithm,  with  n  indicating  the  total 
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3.2  x  102 

0.496 

0.042 

3.301  x  103 

0.493 

0.023 

3.462  x  103 

0.493 

0.023 

0.507 

0.108 

same-300 

297 

3 

0.580 

0.164 

3.2  x  102 

0.585 

0.179 

3.135  x  103 

0.489 

0.052 

3.151  x  103 

0.489 

0.052 

0.594 

0.183 

Table  5.4:  Performance  of  clustering  under  constraints  algorithms  when  the  constraint  level  is  1%.  The  headings  n,  m,  F,  NMI,  A,  and  log-lik  denote 
the  number  of  data  points,  the  number  of  constraints,  the  F-score,  the  normalized  mutual  information,  the  optimal  A  for  the  proposed  algorithm 
found  by  the  validation  procedure,  and  the  log-likelihood,  respectively. 


Proposed 

Shental,  default  init. 

Shental,  special  init. 

Shental,  combined 

Basu 

n 

m 

F 

NMI 

A 

F 

NMI 

log-lik 

F 

NMI 

log-lik 

F 

NMI 

F 

NMI 

derm 

366 

7 

0.817 

0.868 

0 

0.817 

0.868 

0.817 

0.868 

0.817 

0.868 

0.838 

0.880 

digits 

5620 

112 

0.747 

0.710 

0.756 

0.745 

—6.153  x  103 

0.747 

0.721 

-6.155  x  103 

0.756 

0.745 

0.815 

0.760 

ion 

351 

7 

0.755 

0.175 

0.651 

0.046 

ISHBIQi 

0.651 

0.046 

0.651 

0.046 

0.721 

0.140 

mf eat-f ou 

2000 

40 

0.737 

0.678 

0.701 

0.676 

0.756 

0.705 

0.701 

0.676 

0.752 

0.700 

UCI-seg 

2310 

46 

0.721 

0.709 

imgjg^ 

0.702 

0.667 

0.711 

0.683 

0.711 

0.683 

0.623 

0.577 

wdbc 

569 

11 

0.913 

0.636 

0.671 

0.151 

0.681 

0.002 

0.671 

0.151 

0.909 

0.556 

austra 

690 

14 

0.616 

0.058 

0.523 

0.001 

0.601 

0.025 

0.523 

0.001 

0.586 

0.021 

german 

1000 

20 

0.631 

0.011 

0.651 

0.001 

0.648 

0.008 

0.648 

0.008 

0.579 

0.005 

heart 

270 

5 

0.762 

0.205 

0.594 

0.026 

0.594 

0.026 

0.594 

0.026 

0.830 

0.339 

sat 

6435 

129 

0.719 

0.637 

3.2  x  101 

0.717 

0.634 

-6.627  x  103 

0.715 

0.630 

-6.627  x  103 

0.715 

0.630 

0.713 

0.613 

vehicle 

846 

17 

0.737 

0.562 

1.0  x  103 

0.485 

0.183 

-7.137  x  103 

0.517 

0.261 

-7.079  x  103 

0.517 

0.261 

0.418 

0.114 

script 

12938 

259 

0.735 

0.576 

1.0  x  103 

0.671 

0.518 

-1.617  x  103 

0.670 

0.518 

-1.617  x  103 

0.670 

0.518 

0.654 

0.536 

texture 

4000 

80 

0.976 

0.913 

3.2  x  101 

0.978 

0.918 

-6.482  x  104 

0.978 

0.918 

-6.482  x  104 

0.978 

0.918 

0.958 

0.865 

ethn 

2630 

53 

0.924 

0.613 

3.2  x  102 

0.645 

0.000 

1.498  x  105 

0.648 

0.006 

1.500  x  105 

0.648 

0.006 

0.572 

0.001 

Mondrian 

10201 

204 

0.966 

0.898 

3.2  x  102 

0.809 

0.797 

4.135  x  104 

0.809 

0.797 

4.135  x  104 

0.809 

0.797 

0.766 

0.785 

diff-300 

300 

6 

0.616 

0.199 

3.2  x  102 

0.480 

0.091 

3.157  x  103 

0.492 

0.070 

3.177  x  103 

0.492 

0.070 

0.741 

0.494 

sim-300 

291 

6 

0.532 

0.071 

3.2  x  102 

0.495 

0.029 

3.468  x  103 

0.493 

0.017 

3.137  x  103 

0.495 

0.029 

0.515 

0.137 

same-300 

297 

6 

0.446 

0.025 

1.0  x  102 

0.586 

0.182 

3.091  x  103 

0.493 

0.026 

3.109  x  103 

0.493 

0.026 

0.598 

0.188 

Table  5.5:  Performance  of  clustering  under  constraints  algorithms  when  the  constraint  level  is  2%.  The  headings  n,  m,  F,  NMI,  A,  and  log-lik  denote 
the  number  of  data  points,  the  number  of  constraints,  the  F-score,  the  normalized  mutual  information,  the  optimal  A  for  the  proposed  algorithm 
found  by  the  validation  procedure,  and  the  log-likelihood,  respectively. 


Proposed 

Shental,  default  init. 

Shental,  special  init. 

Shental,  combined 

Basu 

n 

m 

F 

NMI 

A 

F 

NMI 

log-lik 

F 

NMI 

log-lik 

F 

NMI 

F 

NMI 

derm 

366 

ii 

0.951 

0.914 

0.813 

0.868 

!9I 

0.817 

0.868 

■  Qj| 

0.817 

0.868 

0.837 

0.874 

digits 

5620 

169 

0.827 

0.790 

0.758 

0.746 

—6.153  x  105 

0.754 

0.744 

-6.153  x  105 

0.754 

0.744 

0.815 

0.755 

ion 

351 

11 

0.724 

0.130 

0.651 

0.032 

0.651 

0.068 

0.651 

0.032 

0.715 

0.132 

mf eat-f ou 

2000 

60 

0.721 

0.697 

0 

0.717 

0.677 

0.720 

0.697 

0.720 

0.697 

0.684 

0.657 

UCI-seg 

2310 

69 

0.695 

0.621 

3.2  x  105 

0.688 

0.644 

mumm 

0.715 

0.689 

0.715 

0.689 

0.648 

0.638 

wdbc 

569 

17 

0.924 

0.668 

0.907 

0.609 

mmm^i 

0.678 

0.002 

0.678 

0.002 

0.909 

0.556 

austra 

690 

21 

0.640 

0.069 

0.855 

0.428 

0.523 

0.001 

0.523 

0.001 

0.571 

0.013 

german 

1000 

30 

0.591 

0.012 

0.566 

0.001 

0.602 

0.001 

0.602 

0.001 

0.578 

0.005 

heart 

270 

8 

0.762 

0.205 

0.594 

0.027 

0.592 

0.067 

0.594 

0.027 

0.549 

0.012 

sat 

6435 

193 

0.712 

0.585 

0.715 

0.630 

-6.627  x  lO5 

0.715 

0.630 

-6.627  x  105 

0.715 

0.630 

0.716 

0.614 

vehicle 

846 

25 

0.523 

0.257 

0.448 

0.200 

0.383 

0.070 

0.383 

0.070 

0.419 

0.113 

script 

12938 

388 

0.746 

0.563 

0.671 

0.517 

-1.617  x  lO5 

0.670 

0.517 

-1.617  x  105 

0.670 

0.517 

0.654 

0.535 

texture 

4000 

120 

0.978 

0.918 

0.978 

0.918 

llll 

0.978 

0.917 

0.978 

0.918 

0.960 

0.867 

ethn 

2630 

79 

0.956 

0.739 

0.645 

0.000 

1.498  x  105 

0.650 

0.008 

1.499  x  105 

0.650 

0.008 

0.646 

0.068 

Mondrian 

10201 

306 

0.967 

0.900 

0.809 

0.796 

0.809 

0.796 

083^9 

0.809 

0.796 

0.774 

0.753 

diff-300 

300 

9 

0.643 

0.254 

0.481 

0.074 

0.512 

0.175 

0.512 

0.175 

0.629 

0.382 

sim-300 

291 

9 

0.473 

0.039 

0.488 

0.112 

IlilBSjfl 

0.485 

0.046 

lliilfBlQjfl 

0.488 

0.112 

0.574 

0.180 

same-300 

297 

9 

0.527 

0.080 

0.477 

0.027 

0.489 

0.039 

0.489 

0.039 

0.618 

0.222 

Table  5.6:  Performance  of  clustering  under  constraints  algorithms  when  the  constraint  level  is  3%.  The  headings  n,  m1  F,  NMI,  A,  and  log-lik  denote 
the  number  of  data  points,  the  number  of  constraints,  the  F-score,  the  normalized  mutual  information,  the  optimal  A  for  the  proposed  algorithm 
found  by  the  validation  procedure,  and  the  log-likelihood,  respectively. 


Proposed 

Shental,  default  init. 

Shental,  special  init. 

Shental,  combined 

Basu 

n 

m 

F 

NMI 

A 

F 

NMI 

log-lik 

F 

NMI 

log-lik 

F 

NMI 

F 

NMI 

derm 

366 

18 

0.954 

0.918 

wmmi 

0.946 

0.912 

'  [jy 

0.817 

0.868 

isiiiiiiyi 

0.817 

0.868 

0.838 

0.874 

digits 

5620 

281 

0.731 

0.707 

0.779 

0.765 

-6.154  x  105 

0.712 

0.717 

-6.162  x  105 

0.779 

0.765 

0.725 

0.711 

ion 

351 

18 

0.823 

0.367 

umui 

0.650 

0.034 

oumy 

0.650 

0.034 

Qnim 

0.650 

0.034 

0.710 

0.126 

mf eat-f ou 

2000 

100 

0.729 

0.681 

0.757 

0.706 

0.768 

0.722 

0.757 

0.706 

0.746 

0.671 

UCI-seg 

2310 

116 

0.772 

0.721 

0.714 

0.688 

0.713 

0.689 

BBjgjgy 

0.713 

0.689 

0.645 

0.628 

wdbc 

569 

28 

0.879 

0.509 

jimmy 

0.894 

0.574 

iiinmy 

0.892 

0.570 

mmi 

0.894 

0.574 

0.875 

0.444 

austra 

690 

35 

0.796 

0.269 

0.613 

0.000 

0.855 

0.428 

0.613 

0.000 

0.848 

0.409 

german 

1000 

50 

0.573 

0.000 

0.566 

0.001 

0.665 

0.000 

0.566 

0.001 

0.586 

0.000 

heart 

270 

14 

0.838 

0.366 

0.594 

0.027 

0.594 

0.027 

0.594 

0.027 

0.838 

0.364 

sat 

6435 

322 

0.733 

0.624 

0.719 

0.638 

-6.627  x  lO5 

0.717 

0.631 

-6.627  x  105 

0.719 

0.638 

0.715 

0.611 

vehicle 

846 

42 

0.443 

0.132 

0.490 

0.251 

0.475 

0.214 

0.475 

0.214 

0.420 

0.113 

script 

12938 

647 

0.733 

0.552 

{Emmy 

0.672 

0.518 

-1.618  x  lO5 

0.629 

0.485 

-1.612  x  105 

0.629 

0.485 

0.720 

0.551 

texture 

4000 

200 

0.977 

0.914 

0.978 

0.918 

0.978 

0.918 

0.978 

0.918 

0.959 

0.866 

ethn 

2630 

132 

0.958 

0.749 

0.851 

0.408 

1.496  x  105 

0.646 

0.007 

1.497  x  105 

0.646 

0.007 

0.738 

0.234 

Mondrian 

10201 

510 

0.969 

0.904 

0.808 

0.794 

0.808 

0.794 

0.808 

0.794 

0.782 

0.751 

diff-300 

300 

15 

0.664 

0.381 

HHH^l 

0.476 

0.112 

0.493 

0.048 

mmi 

0.493 

0.048 

0.850 

0.623 

sim-300 

291 

15 

0.526 

0.086 

funsfly 

0.482 

0.015 

0.494 

0.034 

0.494 

0.034 

0.606 

0.253 

same-300 

297 

15 

0.413 

0.033 

0.477 

0.079 

0.486 

0.047 

0.477 

0.079 

0.573 

0.166 

Table  5.7:  Performance  of  clustering  under  constraints  algorithms  when  the  constraint  level  is  5%.  The  headings  n,  m1  F,  NMI,  A,  and  log-lik  denote 
the  number  of  data  points,  the  number  of  constraints,  the  F-score,  the  normalized  mutual  information,  the  optimal  A  for  the  proposed  algorithm 
found  by  the  validation  procedure,  and  the  log-likelihood,  respectively. 


Proposed 

Shental,  default  init. 

Shental,  special  init. 

Shental,  combined 

Basu 

n 

m 

F 

NMI 

A 

F 

NMI 

log-lik 

F 

NMI 

log-lik 

F 

NMI 

F 

NMI 

derm 

366 

37 

0.951 

0.914 

0.821 

0.869 

8SIS\“^M 

0.819 

0.869 

jpilisiisjgn 

0.821 

0.869 

0.843 

0.890 

digits 

5620 

562 

0.875 

0.813 

0.780 

0.765 

—6.155  x  103 

0.794 

0.751 

-6.164  x  103 

0.780 

0.765 

0.737 

0.715 

ion 

351 

35 

0.821 

0.354 

0.647 

0.027 

0.647 

0.027 

BSiKEi 

0.647 

0.027 

0.696 

0.111 

mf eat-f ou 

2000 

200 

0.685 

0.658 

0.761 

0.686 

0.754 

0.705 

0.754 

0.705 

0.687 

0.662 

UCI-seg 

2310 

231 

0.667 

0.571 

1.0  x  103 

0.692 

0.640 

IgUHHUm 

0.707 

0.683 

0.692 

0.640 

0.638 

0.613 

wdbc 

569 

57 

0.939 

0.688 

0.795 

0.376 

0.913 

0.624 

0.795 

0.376 

0.851 

0.383 

austra 

690 

69 

0.840 

0.363 

0.855 

0.426 

0.857 

0.431 

jBBUBQi 

0.855 

0.426 

0.855 

0.430 

german 

1000 

100 

0.656 

0.026 

0.646 

0.008 

0.646 

0.008 

0.646 

0.008 

0.582 

0.006 

heart 

270 

27 

0.713 

0.132 

0.612 

0.037 

0.612 

0.037 

0.612 

0.037 

0.834 

0.359 

sat 

6435 

644 

0.795 

0.658 

3.2  x  102 

0.718 

0.637 

-6.627  x  103 

0.717 

0.631 

-6.627  x  103 

0.718 

0.637 

0.719 

0.612 

vehicle 

846 

85 

0.473 

0.164 

1.0  x  103 

0.453 

0.212 

-7.924  x  103 

0.378 

0.069 

-7.258  x  103 

0.378 

0.069 

0.433 

0.117 

script 

12938 

1294 

0.696 

0.518 

1.0  x  103 

0.673 

0.516 

-1.620  x  103 

0.672 

0.515 

-1.620  x  103 

0.672 

0.515 

0.687 

0.506 

texture 

4000 

400 

0.973 

0.903 

3.2  x  10-1 

0.978 

0.919 

-6.475  x  104 

0.976 

0.909 

-6.475  x  104 

0.978 

0.919 

0.952 

0.850 

ethn 

2630 

263 

0.963 

0.772 

3.2  x  102 

0.891 

0.541 

1.497  x  105 

0.933 

0.653 

1.495  x  103 

0.891 

0.541 

0.870 

0.488 

Mondrian 

10201 

1020 

0.970 

0.906 

3.2  x  102 

0.808 

0.789 

4.070  x  104 

0.808 

0.789 

4.070  x  104 

0.808 

0.789 

0.773 

0.757 

diff-300 

300 

30 

0.740 

0.409 

3.2  x  102 

0.607 

0.276 

3.089  x  103 

0.508 

0.157 

3.065  x  103 

0.607 

0.276 

0.898 

0.684 

sim-300 

291 

29 

0.615 

0.201 

1.0  x  104 

0.569 

0.164 

3.133  x  103 

0.493 

0.012 

3.145  x  103 

0.493 

0.012 

0.594 

0.265 

same-300 

297 

30 

0.607 

0.229 

1.0  x  104 

0.519 

0.112 

3.106  x  103 

0.545 

0.164 

3.068  x  103 

0.519 

0.112 

0.641 

0.275 

Table  5.8:  Performance  of  clustering  under  constraints  algorithms  when  the  constraint  level  is  10%.  The  headings  n,  m,  F,  NMI,  A,  and  log-lik  denote 
the  number  of  data  points,  the  number  of  constraints,  the  F-score,  the  normalized  mutual  information,  the  optimal  A  for  the  proposed  algorithm 
found  by  the  validation  procedure,  and  the  log-likelihood,  respectively. 


Proposed 

Shental,  default  init. 

Shental,  special  init. 

Shental,  combined 

Basu 

n 

m 

F 

NMI 

A 

F 

NMI 

log-lik 

F 

NMI 

log-lik 

F 

NMI 

F 

NMI 

derm 

366 

55 

0.954 

0.918 

0.806 

0.861 

0.819 

0.862 

psilL ’-’El 

0.819 

0.862 

0.682 

0.718 

digits 

5620 

843 

0.874 

0.790 

0.760 

0.746 

—6.156  x  103 

0.894 

0.824 

-6.164  x  103 

0.760 

0.746 

0.803 

0.749 

ion 

351 

53 

0.841 

0.388 

umm 

0.643 

0.032 

ISBliBBjl 

0.643 

0.032 

0.643 

0.032 

0.676 

0.094 

mf eat-f ou 

2000 

300 

0.708 

0.679 

mrnmM 

0.763 

0.705 

0.709 

0.679 

0.709 

0.679 

0.673 

0.653 

UCI-seg 

2310 

347 

0.739 

0.692 

jgmm 

0.706 

0.661 

jggglggliy 

0.717 

0.689 

0.717 

0.689 

0.467 

0.395 

wdbc 

569 

85 

0.972 

0.825 

mma 

0.930 

0.685 

0.930 

0.685 

ggnnnja 

0.930 

0.685 

0.913 

0.560 

austra 

690 

104 

0.851 

0.395 

0.861 

0.440 

gjgmgj 

0.861 

0.440 

0.861 

0.440 

0.848 

0.407 

german 

1000 

150 

0.649 

0.008 

0 

0.646 

0.001 

wmsmm 

0.648 

0.010 

0.648 

0.010 

0.653 

0.010 

heart 

270 

41 

0.819 

0.327 

If  "'.'01 

0.760 

0.206 

0.609 

0.034 

0.609 

0.034 

0.830 

0.342 

sat 

6435 

965 

0.796 

0.658 

1.0  x  104 

0.719 

0.638 

-6.628  x  103 

0.717 

0.634 

-6.628  x  103 

0.719 

0.638 

0.703 

0.610 

vehicle 

846 

127 

0.562 

0.237 

3.2  x  103 

0.457 

0.217 

-7.996  x  103 

0.392 

0.064 

-7.399  x  103 

0.392 

0.064 

0.425 

0.114 

script 

12938 

1941 

0.809 

0.620 

1.0  x  103 

0.679 

0.517 

-1.623  x  103 

0.678 

0.517 

-1.623  x  103 

0.679 

0.517 

0.686 

0.501 

texture 

4000 

600 

0.982 

0.927 

3.2  x  102 

0.979 

0.920 

-6.470  x  104 

0.979 

0.920 

-6.470  x  104 

0.979 

0.920 

0.957 

0.860 

ethn 

2630 

395 

0.957 

0.746 

1.0  x  102 

0.897 

0.558 

1.497  x  103 

0.957 

0.743 

1.495  x  105 

0.897 

0.558 

0.865 

0.487 

Mondrian 

10201 

1530 

0.970 

0.905 

3.2  x  102 

0.967 

0.900 

4.033  x  104 

0.967 

0.900 

4.033  x  104 

0.967 

0.900 

0.786 

0.753 

diff-300 

300 

45 

0.807 

0.518 

1.0  x  105 

0.477 

0.065 

3.082  x  103 

0.480 

0.081 

3.164  x  103 

0.480 

0.081 

0.937 

0.770 

sim-300 

291 

44 

0.479 

0.090 

1.0  x  104 

0.484 

0.075 

3.142  x  103 

0.490 

0.015 

3.180  x  103 

0.490 

0.015 

0.602 

0.271 

same-300 

297 

45 

0.436 

0.046 

1.0  x  104 

0.455 

0.041 

3.078  x  103 

0.540 

0.113 

3.108  x  103 

0.540 

0.113 

0.663 

0.340 

Table  5.9:  Performance  of  clustering  under  constraints  algorithms  when  the  constraint  level  is  15%.  The  headings  n,  m,  F,  NMI,  A,  and  log-lik  denote 
the  number  of  data  points,  the  number  of  constraints,  the  F-score,  the  normalized  mutual  information,  the  optimal  A  for  the  proposed  algorithm 
found  by  the  validation  procedure,  and  the  log-likelihood,  respectively. 


Figure  5.7:  F-score  and  NMI  for  different  algorithms  for  clustering  under  constraints  for  the  data  sets 
ethn,  Mondrian,  and  ion.  The  results  of  the  proposed  algorithm,  Shental,  and  Basu  are  represented 
by  the  red  solid  line,  blue  dotted  lines  and  the  black  dashed  line,  respectively.  The  performance  of 
a  classifier  trained  using  all  the  labels  is  shown  by  the  gray  dashdot  line.  The  horizontal  axis  shows 
the  number  of  constraints  as  the  percentage  of  the  number  of  data  points. 
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Figure  5.8:  F-score  and  NMI  for  different  algorithms  for  clustering  under  constraints  for  the  data 
sets  script,  derm,  and  vehicle.  The  results  of  the  proposed  algorithm,  Shental,  and  Basu  are 
represented  by  the  red  solid  line,  blue  dotted  lines  and  the  black  dashed  line,  respectively.  The 
performance  of  a  classifier  trained  using  all  the  labels  is  shown  by  the  gray  dashdot  line.  The 
horizontal  axis  shows  the  number  of  constraints  as  the  percentage  of  the  number  of  data  points. 
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Figure  5.9:  F-score  and  NMI  for  different  algorithms  for  clustering  under  constraints  for  the  data 
sets  wdbc.  The  results  of  the  proposed  algorithm,  Shental,  and  Basu  are  represented  by  the  red 
solid  line,  blue  dotted  lines  and  the  black  dashed  line,  respectively.  The  performance  of  a  classifier 
trained  using  all  the  labels  is  shown  by  the  gray  dashdot  line.  The  horizontal  axis  shows  the  number 
of  constraints  as  the  percentage  of  the  number  of  data  points. 
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Figure  5.10:  F-score  and  NMI  for  different  algorithms  for  clustering  under  constraints  for  the  data 
sets  UCI-seg,  heart  and  austra.  The  results  of  the  proposed  algorithm,  Shental,  and  Basu  are 
represented  by  the  red  solid  line,  blue  dotted  lines  and  the  black  dashed  line,  respectively.  The 
performance  of  a  classifier  trained  using  all  the  labels  is  shown  by  the  gray  dashdot  line.  The 
horizontal  axis  shows  the  number  of  constraints  as  the  percentage  of  the  number  of  data  points. 
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Figure  5.11:  F-score  and  NMI  for  different  algorithms  for  clustering  under  constraints  for  the  data 
sets  german,  sim-300  and  diff-300.  The  results  of  the  proposed  algorithm,  Shental,  and  Basu 
are  represented  by  the  red  solid  line,  blue  dotted  lines  and  the  black  dashed  line,  respectively.  The 
performance  of  a  classifier  trained  using  all  the  labels  is  shown  by  the  gray  dashdot  line.  The 
horizontal  axis  shows  the  number  of  constraints  as  the  percentage  of  the  number  of  data  points. 
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Figure  5.12:  F-score  and  NMI  for  different  algorithms  for  clustering  under  constraints  for  the  data 
sets  sat  and  digits.  The  results  of  the  proposed  algorithm,  Shental,  and  Basu  are  represented  by 
the  red  solid  line,  blue  dotted  lines  and  the  black  dashed  line,  respectively.  The  performance  of  a 
classifier  trained  using  all  the  labels  is  shown  by  the  gray  dashdot  line.  The  horizontal  axis  shows 
the  number  of  constraints  as  the  percentage  of  the  number  of  data  points. 
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Figure  5.13:  F-score  and  NMI  for  different  algorithms  for  clustering  under  constraints  for  the  data 
sets  mfeat-f  ou,  same-300  and  texture.  The  results  of  the  proposed  algorithm,  Shental,  and  Basu 
are  represented  by  the  red  solid  line,  blue  dotted  lines  and  the  black  dashed  line,  respectively.  The 
performance  of  a  classifier  trained  using  all  the  labels  is  shown  by  the  gray  dashdot  line.  The 
horizontal  axis  shows  the  number  of  constraints  as  the  percentage  of  the  number  of  data  points. 
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jected 

Figure  5.14:  The  result  of  simultaneously  performing  feature  extraction  and  clustering  with  con- 
straints  simultaneously  on  the  data  set  in  Figure  5.3(a).  The  blue  line  in  (a)  corresponds  to  the 
projection  direction  found  by  the  algorithm.  The  projected  data  points  (which  is  ID),  together  with 
the  cluster  labels  and  the  two  Gaussians,  are  shown  in  (b). 


(b)  Projected  space 
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(d)  Result  of  spectral  clustering 


Figure  5.15:  An  example  of  learning  the  subspace  and  the  clusters  simultaneously,  (a):  the  original 
data  and  the  constraints,  where  solid  (dotted)  lines  correspond  to  must-link  (must-not-link)  con¬ 
straints.  (b)  Clustering  result  of  projecting  20  features  extracted  by  kernel  PCA  to  a  2D  space,  (c) 
Clustering  solution  (d)  Result  of  applying  spectral  clustering  [194]  to  this  data  set  with  two  clusters, 
using  the  same  kernel  used  for  kernel  PCA. 
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number  of  data  points.  Multiplication  by  the  inverse  of  the  Hessian  in  the  line-search  Newton 
algorithm  can  be  performed  in  0(d°)  time,  because  the  structure  of  the  0(dz)  by  0(dz)  Hessian 
matrix  is  utilized  for  the  inversion,  and  the  matrix  need  not  be  formed  explicitly.  Unless  the  data 
set  is  very  small,  the  time  cost  of  inverting  the  Hessian  is  insignificant  when  compared  with  the 
calculation  of  the  objective  function  J  and  its  gradient.  Therefore,  one  function  evaluation  of  the 
proposed  algorithm  is  only  marginally  slower  than  one  iteration  of  the  EM  algorithm  for  the  mixture 
model.  Note  that  one  Newton  iteration  can  involve  more  than  one  function  evaluation  because  of 
the  line-search. 

Each  iteration  in  the  algorithm  Shental  is  similar  to  that  in  the  standard  EM  algorithm.  The 
difference  is  in  the  E-step,  in  which  Shental  involves  an  inference  for  a  Markov  network.  This  can 
take  exponential  time  with  respect  to  the  number  of  constraints  in  the  worst  case.  The  per-iteration 
computation  cost  in  Basu  is  in  general  smaller  than  both  Shental  and  the  proposed  algorithm, 
because  it  is  fundamentally  the  fc-means  algorithm.  However,  the  use  of  iterative  conditional  mode 
to  solve  the  cluster  labels  in  the  hidden  Markov  random  fields,  as  well  as  the  metric  learning  based 
on  the  constraints,  becomes  the  overhead  due  to  the  constraints. 

In  practice,  the  proposed  algorithm  is  slower  than  the  other  two  because  of  the  cross-validation 
procedure  to  determine  the  optimal  A.  Even  when  A  is  fixed,  however,  the  proposed  algorithm  is  still 
slower  because  (i)  the  optimization  problem  considered  by  the  proposed  algorithm  is  more  difficult 
than  those  considered  by  Shental  and  Basu,  and  (ii)  the  convergence  criteria  based  on  the  relative 
norm  of  gradient  is  stricter. 

5.7.2  Discriminative  versus  Generative 

One  way  to  view  the  difference  between  the  proposed  algorithm  and  the  algorithms  Shental  and 
Basu  is  that  both  Shental  and  Basu  are  generative,  whereas  the  proposed  approach  is  a  combination 
of  generative  and  discriminative.  In  supervised  learning,  a  classifier  is  “generative”  if  it  assumes  a 
certain  model  on  how  the  data  from  different  classes  are  generated  via  the  specification  of  the  class 
conditional  densities,  whereas  a  “discriminative”  classifier  is  built  by  optimizing  some  error  measure, 
without  any  regard  to  the  class  conditional  densities.  Discriminative  approaches  are  often  superior 
to  generative  approaches  when  the  actual  class  conditional  densities  differ  from  their  assumed  forms. 
On  the  other  hand,  incorporation  of  prior  knowledge  is  easier  for  generative  approaches  because  one 
can  construct  a  generative  model  based  on  the  domain  knowledge.  Discriminative  approaches  are 
also  more  prone  to  overfitting. 

In  the  context  of  clustering  under  constraints,  Shental  and  Basu  can  be  regarded  as  generative 
because  they  specify  a  hidden  Markov  random  field  to  describe  how  the  data  are  generated.  The 
constraint  violation  term  T(9;C)  used  by  the  proposed  algorithm  is  discriminative,  because  it  effec¬ 
tively  counts  the  number  of  violated  constraints,  which  are  analogous  to  the  number  of  misclassified 
samples.  The  log-likelihood  term  C(9;y)  in  the  proposed  objective  function  is  generative  because  it 
is  based  on  how  the  data  are  generated  by  a  finite  mixture  model.  Therefore,  the  proposed  approach 
is  both  generative  and  discriminative,  with  the  tradeoff  parameter  A  controlling  the  relative  impor¬ 
tance  of  these  two  properties.  One  can  think  that  the  discriminative  component  enables  the  proposed 
algorithm  to  have  a  higher  performance,  whereas  the  generative  component  acts  as  a  regularization 
term  to  prevent  overfitting  in  the  discriminative  component. 

This  discussion  provides  a  new  perspective  in  viewing  the  example  in  Figure  5.3.  Shental  and 
Basu,  being  generative,  failed  to  recover  the  two  desired  clusters  because  their  forms  differ  signifi¬ 
cantly  from  what  Shental  and  Basu  assume  about  a  cluster.  On  the  other  hand,  the  discriminative 
property  of  the  proposed  algorithm  can  locate  the  desired  vertical  cluster  boundary,  which  can  satisfy 
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the  constraints. 

The  discriminative  nature  of  the  proposed  algorithm  is  also  the  reason  why  the  proposed  algo¬ 
rithm,  using  constraints  only,  can  outperform  the  generative  classifier  using  all  the  labels.  This  is 
surprising  at  first,  because,  after  all,  constraints  carry  less  information  than  labels.  Incorporating  the 
constraints  on  only  some  of  the  objects  therefore  should  not  outperform  the  case  when  the  labels  of 
all  objects  are  available.  However,  this  is  only  true  when  all  possible  classifiers  are  considered.  When 
we  restrict  ourself  to  the  generative  classifier  that  assumes  a  Gaussian  distribution  with  common 
covariance  matrix  as  the  class  conditional  density,  it  is  possible  for  a  discriminative  algorithm  to 
outperform  the  generative  classifier  if  the  class  conditional  densities  are  non-Gaussians.  In  fact,  for 
the  data  sets  ethn,  Mondrian,  script,  wdbc,  and  texture,  we  observed  that  the  proposed  algorithm 
can  have  a  higher  F-score  or  NMI  than  that  estimated  using  all  the  class  labels.  The  difference  is 
more  noticeable  for  script  and  wdbc.  Note  that  for  the  data  set  austra,  the  generative  algorithm 
Shental  can  also  out-perform  the  classifier  trained  using  all  the  labels,  though  the  difference  is  very 
small  and  it  may  be  due  to  the  noisy  nature  of  this  data  set. 

5.7.3  Drawback  of  the  Proposed  Approach 

There  are  two  main  drawbacks  of  the  proposed  approach.  The  optimization  problem  considered, 
while  accurately  representing  the  goal  of  clustering  with  constraints,  is  more  difficult.  This  has 
several  consequences.  First,  a  more  sophisticated  algorithm  (line-search  Newton)  is  needed  instead 
of  the  simpler  EM  algorithm.  The  landscape  of  the  proposed  objective  function  is  more  “rugged” . 
So,  it  it  is  more  likely  to  get  trapped  in  poor  local  optima.  It  also  takes  more  iterations  to  reach  a 
local  optimum.  Because  we  are  initializing  randomly,  this  also  means  that  the  proposed  algorithm 
is  not  very  stable  if  we  have  an  insufficient  number  of  random  initializations. 

The  second  difficulty  is  the  determination  of  A.  (Note  that  the  algorithm  Basu  has  a  similar 
parameter.)  In  our  experiments,  we  adopted  a  cross-validation  procedure  to  determine  A,  which  is 
computationally  expensive.  Cross-validation  may  yield  a  suboptimal  A  when  the  number  of  infor¬ 
mative  constraints  in  the  validation  set  is  too  small,  or  when  too  many  constraints  are  erroneous 
due  to  the  noise  in  the  data.  Here,  a  constraint  is  informative  if  it  provides  “useful”  information  to 
the  clustering  process.  So,  a  must-link  constraint  between  two  points  close  to  each  other  is  not  very 
informative  because  they  are  likely  to  be  in  the  same  cluster  anyway. 

Another  problem  is  that  we  may  encounter  an  unfavorable  split  of  the  training  and  validation 
constraints  when  the  set  of  available  constraints  is  too  small.  When  this  happens,  the  number  of 
violations  for  the  validation  constraints  is  significantly  larger  than  that  of  the  training  constraints. 
Increasing  the  value  of  A  cannot  reduce  the  violation  of  the  validation  constraints,  leading  to  an 
optimal  constraint  strength  of  zero.  When  this  happens,  we  should  try  a  different  split  of  the 
constraints  for  training  and  validation. 

5.7.4  Some  Implementation  Details 

We  have  incorporated  some  heuristics  in  our  optimization  algorithm.  During  the  optimization  pro¬ 
cess,  a  cluster  may  become  almost  empty.  This  is  detected  when  J2i^ij/n  falls  below  a  threshold, 
which  is  set  to  4  x  10  At.  The  empty  cluster  is  removed,  and  the  largest  cluster  that  can  result  in 
the  increase  in  the  J  value  is  split  to  maintain  the  same  number  of  clusters.  If  no  such  cluster  exists, 
the  one  that  can  lead  to  the  smallest  decrease  in  J  is  split.  Another  heuristic  is  that  we  lower-bound 
atj  by  10  ® ,  no  matter  what  the  values  of  {(3j}  are.  This  is  used  to  improve  the  numerical  stability 
of  the  proposed  algorithm.  The  a.j  are  then  renormalized  to  ensure  that  they  sum  to  one. 


143 


5.8  Summary 


We  have  presented  an  algorithm  that  handles  instance-level  constraints  for  model-based  clustering. 
The  key  assumption  in  our  approach  is  that  the  cluster  labels  are  determined  based  on  the  feature 
vectors  and  the  cluster  parameters;  the  set  of  constraints  has  no  influence  here..  This  contrasts 
with  previous  approaches  like  [231]  and  [21]  which  impose  prior  distribution  on  the  cluster  labels 
directly  to  reflect  the  constraints.  This  is  the  fundamental  reason  for  the  anomaly  described  in 
Section  5.2.  The  actual  clustering  is  performed  by  the  line-search  Newton  algorithm  under  the 
natural  parameterization  of  the  Gaussian  distributions.  The  strength  of  the  constraints  is  determined 
by  a  hold-out  set  of  validation  constraints.  The  proposed  approach  can  be  extended  to  handle 
simultaneously  feature  extraction  and  clustering  under  constraints.  The  effectiveness  of  the  proposed 
approach  has  been  demonstrated  on  both  synthetic  data  sets  and  real-world  data  sets  from  different 
domain.  In  particular,  we  notice  that  the  discriminative  nature  of  the  proposed  algorithm  can  lead 
to  superior  performance  when  compared  with  a  generative  classifier  trained  using  the  labels  of  all 
the  objects. 
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Chapter  6 


Summary 


The  primary  objective  of  the  work  presented  in  this  dissertation  is  to  advance  the  state-of-the-art  in 
unsupervised  learning.  Unsupervised  learning  is  challenging  because  its  objective  is  often  ill-defined. 
Instead  of  providing  yet  another  new  unsupervised  learning  algorithm,  we  are  more  interested  in 
studying  issues  that  are  generic  to  different  unsupervised  learning  tasks.  This  is  the  motivation 
behind  the  study  of  various  topics  in  this  dissertation,  including  the  modification  of  the  batch 
version  of  an  algorithm  to  become  incremental,  the  selection  of  the  appropriate  data  representation 
(feature  selection),  and  the  incorporation  of  side- information  in  an  unsupervised  learning  task. 


6.1  Contributions 

The  results  in  this  thesis  have  contributed  to  the  held  of  unsupervised  learning  in  several  ways,  and 
has  led  to  the  publication  of  two  journal  articles  [163,  164],  Several  conference  papers  [168,  161,  167, 
165,  82]  have  also  been  published  at  different  stages  of  the  research  conducted  in  this  thesis. 

The  incremental  ISOMAP  algorithm  described  in  Chapter  3  has  made  the  following  contributions: 

•  Framework  for  incremental  manifold  learning:  The  proposed  incremental  ISOMAP  algorithm 
can  serve  as  a  general  framework  for  converting  a  manifold  learning  algorithm  to  become 
incremental:  the  neighborhood  graph  is  first  updated,  followed  by  the  update  of  the  low¬ 
dimensional  representation,  which  is  often  an  incremental  eigenvalue  problem  similar  to  our 
case. 

•  Solution  of  the  all-pairs  shortest  path  problems:  One  component  in  the  incremental  algorithm 
is  to  update  the  all-pairs  shortest  path  distances  in  view  of  the  change  in  the  neighborhood 
graph  due  to  the  new  data  points.  We  have  developed  a  new  algorithm  that  performs  such 
an  update  efficiently.  Our  algorithm  updates  the  shortest  path  distances  from  multiple  source 
vertices  simultaneously.  This  contrasts  with  previous  work  like  [193],  where  different  shortest 
path  trees  are  updated  independently. 

•  Improved  embedding  for  new  data  points:  We  have  derived  an  improved  estimate  of  the  inner 
product  between  the  low-dimensional  representation  of  the  new  point  and  the  low-dinrensional 
representations  of  the  existing  points.  This  leads  to  an  improved  embedding  for  the  new  point. 

•  Algorithm  for  incremental  eigen-decomposition  with  increasing  matrix  size:  The  problem  of 
updating  the  low-dinrensional  representation  of  the  data  points  is  essentially  an  incremental 
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eigen-decomposition  problem.  Unlike  the  previous  work  [270],  however,  the  size  of  the  matrix 
we  considered  is  increasing. 

•  Vertex  contraction  to  memorize  the  effect  of  data  points:  A  vertex  contraction  procedure  that 
improves  the  geodesic  distance  estimate  without  additional  memory  is  proposed. 

Our  work  on  estimating  the  feature  saliency  and  the  number  of  clusters  simultaneously  in  Chap¬ 
ter  4  has  made  the  following  contributions: 

•  Feature  Saliency  in  unsupervised  learning:  The  problem  of  feature  selection/feature  saliency 
estimation  is  rarely  studied  for  unsupervised  learning.  We  tackle  this  problem  by  introducing  a 
notion  of  feature  saliency,  which  is  able  to  describe  the  difference  between  the  distributions  of 
a  feature  among  different  clusters.  The  saliency  is  estimated  efficiently  by  the  EM  algorithm. 

•  Automatic  Feature  Saliency  and  Determination  of  the  Number  of  Clusters:  The  algorithm  in 
[81],  which  utilizes  the  minimum  message  length  to  select  the  number  of  clusters  automatically, 
is  extended  to  estimate  the  feature  saliency. 

The  clustering  under  constraints  algorithm  proposed  in  Chapter  5  has  made  the  following  con¬ 
tributions: 

•  New  objective  function  for  clustering  under  constraints:  We  have  proposed  a  new  objective 
function  for  clustering  under  constraints  under  the  assumption  that  the  constraints  do  not 
have  any  direct  influence  on  the  cluster  labels.  Extensive  experimental  evaluations  reveal  that 
this  objective  function  is  superior  to  the  other  state-of-the-art  algorithms  in  most  cases.  It  is 
also  easy  to  extend  the  proposed  objective  function  to  handle  group  constraints  that  involve 
more  than  two  data  points. 

•  Avoidance  of  Counter-intuitive  Clustering  Result: 

The  proposed  objective  function  can  avoid  the  pitfall  of  previous  clustering  under  constraints 
algorithms  like  [231]  and  [21],  which  are  based  on  hidden  Markov  random  field.  Specifically, 
clustering  solutions  that  assign  the  cluster  label  to  a  data  point  that  is  different  from  all  its 
neighbors  is  possible  for  previous  algorithms,  a  situation  avoided  by  the  proposed  algorithm. 

•  Robustness  to  model-mismatch: 

The  proposed  objective  function  for  clustering  under  constraints  is  a  combination  of  generative 
and  discriminative  terms.  The  discriminative  term,  which  is  based  on  the  satisfaction  of 
the  constraints,  improves  the  robustness  of  the  proposed  algorithm  towards  mismatch  in  the 
cluster  shape.  This  leads  to  an  improvement  in  the  overall  performance.  The  improvement  can 
sometimes  be  so  significant  that  the  proposed  algorithm,  using  constraints  only,  outperforms 
a  generative  supervised  classifier  trained  using  all  the  labels. 

•  Feature  extraction  and  clustering  with  constraints:  The  proposed  algorithm  has  been  extended 
to  perform  feature  extraction  and  clustering  with  constraints  simultaneously  by  locating  the 
best  low-dinrensional  subspace,  such  that  the  Gaussian  clusters  formed  will  satisfy  the  given 
set  of  constraints  as  well  as  they  can.  This  allows  the  proposed  algorithm  to  handle  data 
sets  with  higher  dimensionality.  The  combination  of  this  notion  of  feature  extraction  and  the 
kernel  trick  allows  us  to  extract  clusters  with  general  shapes. 

•  Efficient  implementation  of  the  Line-search  Newton  Algorithm: 
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The  proposed  objective  function  is  optimized  by  the  line-search  Newton  algorithm.  The  multi¬ 
plication  by  the  inverse  of  the  Hessian  for  the  case  of  a  Gaussian  mixture  can  be  done  efficiently 
with  time  complexity  0(dr)  without  forming  the  0(dz)  by  0(dz)  Hessian  matrix  explicitly. 
Here,  d  denotes  the  number  of  features.  A  naive  approach  of  inverting  the  Hessian  would 
require  0(cfi)  time. 

6.2  Future  work 

The  study  conducted  in  this  dissertation  leads  to  several  interesting  new  research  possibilities. 

•  Improvement  in  the  efficiency  of  the  incremental  ISOMAP  algorithm 

There  are  several  possibilities  for  improving  the  efficiency  of  the  proposed  incremental  ISOMAP 
algorithm.  Data  structures  such  as  kd- tree,  ball-tree,  and  cover-tree  [19]  can  be  used  to  speed 
up  the  search  of  the  k  nearest  neighbors.  The  update  strategy  for  geodesic  distance  and 
co-ordinates  can  be  more  aggressive;  we  can  sacrifice  the  theoretical  convergence  property  in 
favor  of  empirical  efficiency.  For  example,  the  geodesic  distance  can  be  updated  approximately 
using  a  scheme  analogous  to  the  distance  vector  protocol  in  the  network  routing  literature. 
Co-ordinate  update  can  be  made  faster  if  only  a  subset  of  the  co-ordinates  (such  as  those  close 
to  the  new  point)  are  updated  at  each  iteration.  The  co-ordinates  of  every  point  would  be 
finally  updated  if  the  new  points  came  from  different  regions  of  the  manifold. 

•  Incrementalization  of  other  manifold  learning  algorithms 

The  algorithm  in  Chapter  3  modifies  the  ISOMAP  algorithm  to  become  incremental.  We  can 
also  modify  similar  algorithms,  such  as  locally  linear  embedding  or  Laplacian  eigenmap  to 
become  incremental. 

•  Features  dependency  in  dimensionality  reduction  and  unsupervised  learning:  The  algorithm 
in  Chapter  4  assumes  that  the  features  are  conditionally  independent  of  each  other  when  the 
cluster  labels  are  known.  This  assumption,  however,  is  generally  not  true  in  practice.  A  new 
algorithm  needs  to  be  designed  to  cope  with  the  situation  when  features  are  highly  correlated 
in  this  setting. 

•  Feature  selection  and  constraints: 

The  main  difficulty  of  feature  selection  in  clustering  is  the  ill-posed  nature  of  the  problem.  A 
possible  way  to  make  the  problem  more  well-defined  is  to  introduce  instance- level  constraints. 
In  Section  5.5,  we  described  an  algorithm  for  performing  feature  extraction  and  clustering 
under  constraints  simultaneously.  One  can  apply  a  similar  idea  and  use  the  constraints  to 
assist  in  feature  selection  for  clustering. 

•  More  efficient  algorithms  for  clustering  with  constraints 

The  use  of  line-search  Newton  algorithm  for  optimizing  the  objective  function  in  Chapter  5  is 
relatively  efficient  when  compared  with  alternative  approaches.  Unfortunately,  the  objective 
function,  which  effectively  uses  Jensen-Shannon  divergence  to  count  the  number  of  violated 
constraints,  is  difficult  to  optimize.  It  is  similar  to  the  minimization  of  the  number  of  classi¬ 
fication  errors  directly  in  supervised  learning,  which  is  generally  perceived  as  difficult.  Often, 
the  number  of  errors  is  approximated  by  some  quantities  that  are  easier  to  optimize,  such  as 
the  distances  of  mis-classified  points  from  the  separating  hyperplane  in  the  case  of  support 
vector  machines.  In  the  current  context,  we  may  want  to  approximate  the  number  of  violated 
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constraints  by  some  quantities  that  are  easier  to  optimize.  A  difficulty  can  arise,  however, 
when  both  must-link  and  must-not-link  constraints  are  considered.  If  the  violation  of  a  must- 
link  constraint  is  approximated  by  a  convex  function  g(.),  the  violation  of  a  must-not-link 
constraint  is  naturally  approximated  by  —  <?(.),  which  is  concave.  Their  combination  leads  to  a 
function  that  is  neither  concave  nor  convex,  which  is  difficult  to  optimize.  Techniques  like  DC 
(difference  of  convex  functions)  programming  [117]  can  be  adopted  for  global  optimization. 

•  Number  of  clusters  for  clustering  with  constraints 

The  algorithm  described  in  Chapter  5  assumes  that  the  number  of  clusters  is  known.  It  is 
desirable  if  the  number  of  clusters  can  be  estimated  automatically  from  the  data.  The  presence 
of  constraints  should  be  helpful  in  this  process.  In  fact,  correlation  clustering  [10]  considers 
must-link  and  must-not-link  constraints  only,  without  any  regard  to  the  feature  vectors,  and 
it  can  infer  the  optimal  number  of  clusters  by  minimizing  the  number  of  constraint  violations. 
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Appendix  A 


Details  of  Incremental  ISOMAP 


In  this  appendix,  we  present  the  proof  for  the  correctness  of  the  algorithms  in  chapter  3  as  well  as 
analyzing  their  time  complexity. 

A.l  Update  of  Neighborhood  Graph 

The  procedure  to  update  the  neighborhood  graph  has  been  described  in  section  3.2. 1.1,  where  A, 
the  set  of  edges  to  be  added,  and  T> ,  the  set  of  edges  to  be  deleted,  are  constructed  upon  insertion 
of  vn-\-\  to  the  neighborhood  graph. 

Time  Complexity  For  time  complexity,  note  that  for  each  i,  the  conditions  in  Equations  (3.1) 
and  (3.2)  can  be  checked  in  constant  time.  So,  the  construction  of  A  and  T>  takes  0(n )  time.  The 
calculation  of  tj  for  all  i  can  be  done  in  0(EJ— \  deg{vf)  +  |A| )  or  0(\E\  +  |A|)  time  by  examining  the 
neighbors  of  different  vertices.  Here,  deg(v f)  denotes  the  degree  of  Vj.  The  complexity  of  the  update 
of  neighborhood  graph  can  be  bounded  by  0(nq),  where  q  is  the  maximum  degree  of  the  vertices  in 
the  graph  after  inserting  vn  .  p  Note  that  Lj  becomes  the  for  the  updated  neighborhood  graph. 

A. 2  Update  of  Geodesic  Distances:  Edge  Deletion 

A. 2.1  Finding  Vertex  Pairs  For  Update 

In  this  section,  we  examine  how  the  geodesic  distances  should  be  updated  upon  edge  deletion. 
Consider  an  edge  e(a,  b)  €  V  that  is  to  be  deleted.  If  7tq^  yl  a,  the  shortest  path  between  va  and  v ^ 
does  not  contain  e{a1b).  Deletion  of  e(a,b)  does  not  affect  sp(a,b)  and  hence  none  of  the  existing 
shortest  paths  are  affected.  Therefore,  we  have 

Lemma  A.l.  If  na^  ^  a,  deletion  of  e(a,b)  does  not  affect  any  of  the  existing  shortest  paths  and 
therefore  no  geodesic  distance  g^j  needs  to  be  updated. 

We  now  consider  the  case  7ra^  =  a.  This  implies  7r^a  =  b  because  the  graph  is  undirected.  The 
next  lemma  is  an  easy  consequence  of  this  assumption. 

Lemma  A. 2.  For  any  vertex  v^,  sp(i,  b )  passes  through  i>a  iff  sp(i ,  b )  contains  e(a,  b)  iff  =  a. 

Before  we  proceed  further,  recall  the  definitions  of  T(b)  and  T(b\a)  in  section  3.1:  T(b)  is  the 
shortest  path  tree  of  where  the  root  node  is  v ^  and  sp(b,j)  consists  of  the  tree  edges  from  v ^  to 
Vj,  and  T(b;  a)  is  the  subtree  of  T(6)  rooted  at  va- 
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Let  Rab  =  {i  :  7 =  a}.  Intuitively,  Rab  contains  vertices  whose  shortest  paths  to  vb  include 
e(a,b).  We  shall  first  construct  Rab,  and  then  “propagate”  from  R  b  to  get  the  geodesic  distances 
that  require  update. 

Because  sp(t ,  b)  passes  through  the  vertices  that  are  the  ancestor  of  v t  in  T(6),  plus  Vf,  we  have 

Lemma  A. 3.  Rab  =  {  vertices  in  T(b;a)  }. 

Proof. 

vt  S  T (6;  a) 

Va  is  an  ancestor  of  Vf  in  T(&),  or  Va  =  Vf 

sp(t,  b)  passes  through  Va 

ir^  =  a  (lemma  A. 2) 

t  £  Rab 

□ 

If  is  a  child  of  vu  in  T(5),  vu  is  the  vertex  in  sp{b ,  t)  just  before  v-j-.  Thus,  we  have  the  lemma 
below. 

Lemma  A. 4.  The  set  of  children  of  Vu  in  T(b)  =  { v f  :  is  a  neighbor  of  Vu  and  tt ^  =  u}. 

Consequently,  we  can  examine  all  the  neighbors  of  Vy  to  find  the  node’s  children  in  T(6)  based 
on  the  predecessor  matrix.  Note  that  the  shortest  path  trees  are  not  stored  explicitly;  only  the 
predecessor  matrix  is  maintained.  The  first  nine  lines  in  Algorithm  3.1  perform  a  tree  traversal  that 
extracts  all  the  vertices  in  T(b\  a)  to  form  Rab,  using  Lemma  A. 4  to  find  all  the  children  of  a  node 
in  the  tree. 


Time  Complexity  At  any  time,  the  queue  Q  contains  vertices  in  the  subtree  T (6;  a)  that  have 
been  examined.  The  while- loop  is  executed  \Rab\  times  because  a  new  vertex  is  added  to  Rab  in 
each  iteration.  The  inner  for-loop  is  executed  a  total  of  Yhy^R  b  de9(vt)i  which  can  be  bounded 
loosely  by  q\Rab\.  Therefore,  a  loose  bound  for  the  first  nine  lines  in  Algorithm  3.1  is  0{q\Rab\). 


A. 2. 2  Propagation  Step 

Define  F^a  ^  =  {(i,j)  :  sp(i,j)  contains  e(a,  b)}.  Here,  (a,  b)  denotes  the  unordered  pair  a  and 
b.  So,  F^a  ^  is  indexed  by  the  unordered  pair  (a,  b),  and  its  elements  are  also  unordered  pairs. 
Intuitively,  F/a  ^  contains  the  vertex  pairs  whose  geodesic  distances  need  to  be  recomputed  when 
the  edge  e(a,  b)  is  deleted.  Starting  from  v b  for  each  of  the  vertex  in  Rab,  we  construct  F^a  ^  by  a 
search. 

Lemma  A. 5.  If  (i,j)  £  F^a  by  either  i  or  j  is  in  Rab- 

Proof.  ( i,j )  £  F( ab )  is  equivalent  to  sp(i,j)  contains  e(a,b).  The  shortest  path  sp(i,j)  can  be 
written  either  as  sp(i,j)  =  Wj  •w  Va  — >  vb  Vj ,  or  sp(i,j)  =  vb  — >  Va  Vj,  where  denotes 
a  path  between  the  two  vertices.  Because  the  subpath  of  a  shortest  path  is  also  a  shortest  path, 
either  sp(i,  b)  or  sp(j ,  b)  passes  through  Va ■  By  lemma  A. 2,  either  tt ^  =  a  or  tt jb  =  a.  Hence  either 
i  or  j  is  in  R  ab.  □ 

Lemma  A.6.  F^b)  =  |J  { (u,  t)  :  Vf  is  in  T (u;  &)} . 
u^^ab 
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Figure  A.l:  Example  of  T(it;  b)  and  T(a;  b).  All  the  nodes  and  the  edges  shown  constitute  T(a;  b), 
whereas  only  the  part  of  the  subtree  above  the  line  constitutes  T(it;  b).  This  example  illustrates  the 
relationship  of  T(it;  b)  and  T(a;  b)  as  proved  in  Lemma  A. 7. 


Proof.  By  lemma  A. 5,  (u,t)  £  F^a  ^  implies  either  u  or  t  is  in  Rai-  Without  loss  of  generality, 
suppose  u  £  Ra}j-  So,  sp(u,t)  can  be  written  as  Vy  va  — >  Vf.  Thus  Vf  must  be  in  T(u;b). 

On  the  other  hand,  for  any  vertex  Vf  in  the  subtree  of  T (it;  b),  sp(u,  t)  goes  through  vu.  Since  sp(u,  b ) 
goes  through  va  (because  u  £  Ra^),  sp(u,t)  must  also  go  through  Va  and  hence  use  e(a,b).  □ 

Direct  application  of  the  above  lemma  to  compute  F^a  ^  requires  the  construction  of  T(it;  b)  for 
different  it.  This  is  not  necessary,  however,  because  for  all  u  £  Ra{ ,,  T(it;  b)  must  be  a  part  of  T(a;  b ) 
in  the  sense  that  is  exemplified  in  Figure  A.l.  This  relationship  aids  the  construction  of  T(u;b)  in 
Algorithm  3.1  (the  variable  Tr)  because  we  only  need  to  expand  the  vertices  in  T(ct;  b)  that  are  also 
in  T (it;  b). 

Lemma  A. 7.  Consider  u  £  R a^.  The  subtree  T(u;  b)  is  non-empty,  and  let  Vf  be  any  vertex  in  this 
subtree.  Let  Vs  be  a  child  of  Vf  inT(u;b),  if  any.  We  have  the  following: 

1.  vi  is  in  the  subtree  ofT(a;b). 

2.  vs  is  a  child  of  v %  in  the  subtree  of  T (a;  b) . 

3-  nus  =  ^as  =  t 

Proof.  The  subtree  T{u\b)  is  not  empty  because  v ^  is  in  this  subtree.  For  any  Vf  in  this  subtree, 
sp(u,t)  passes  through  v^.  Hence  sp(u,b)  is  a  subpath  of  sp(u,t).  Because  u  £  R a^,  sp(u,b)  passes 
through  va-  So,  we  can  write  sp(u,t )  as  Vy  Va  — >  v^  vp  So,  sp(a,t)  contains  v^,  and  this 
implies  that  Vf  is  in  T (a;  b) . 

Now,  if  vs  is  a  child  of  Vf  in  T(u;b),  sp(u,s)  can  be  written  as  vy  vy  — >  — >  vs- 

So,  nus  =  t.  Because  any  subpath  of  a  shortest  path  is  also  a  shortest  path,  sp(a,  s)  is  simply 
Va  >  Vf)  Vf  — >  vs,  which  implies  that  vs  is  also  a  child  of  Vf  in  T(a;  6),  and  nas  =  t-  Therefore, 
we  have  nys  =  T^as  =  t.  □ 
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Let  F  be  the  set  of  unordered  pair  (i,j)  such  that  a  new  shortest  path  from  v j  to  Vj  is  needed  when 
edges  in  V  are  removed.  So,  F  =  {Je(u  ^\a  fjj  ■  For  each  (a,  b)  €  V,  R constructed  in  the  first 
nine  lines  in  Algorithm  3.1  is  used  to  construct  ^  from  line  11  until  the  end  of  Algorithm  3.1. 
At  each  iteration  of  the  while-loop  starting  at  line  15,  the  subtree  T(a;6)  is  traversed,  using  the 
condition  nUs  =  7r as  to  check  if  Vs  is  in  T (it;  b)  or  not.  The  part  of  the  subtree  T (a;  b)  is  expanded 
only  when  necessary,  using  the  variable  T/. 


Time  Complexity  If  we  ignore  the  time  to  construct  T  ,  the  complexity  of  the  construction  of 
F  is  proportional  to  the  number  of  vertices  examined.  If  the  maximum  degree  of  T/  is  </ ,  this  is 
bounded  by  0(q'\F\).  Note  that  (/  <  q ,  where  q  is  the  maximum  degree  of  the  vertices  in  the 
neighborhood  graph.  The  time  to  expand  T/  is  proportional  to  the  number  of  vertices  actually 
expanded  plus  the  number  of  edges  incident  on  those  vertices.  This  is  bounded  by  q  times  the  size 
of  the  tree,  and  the  size  of  the  tree  is  at  most  0(\F^a  ^  |).  Usually,  the  time  is  much  less,  because 
different  u  in  Ra^  can  reuse  the  same  Tr .  The  time  complexity  to  construct  F^a  ^  can  be  bounded 
by  0(q\F^a  ^  |)  in  the  worst  case.  The  overall  time  complexity  to  construct  F,  which  is  the  union  of 
F[a  b)  f°r  (aA)  €  E>,  is  0(q\F\),  assuming  the  number  of  duplicate  pairs  in  F^a  ^  for  different 
(a,  b)  is  0(1).  Empirically,  there  are  at  most  several  such  pairs.  Most  of  the  time,  there  is  no 
duplicate  pair  at  all. 


A. 2. 3  Performing  The  Update 

Let  Q1  =  (U,  E/T>),  the  graph  after  deleting  the  edges  in  V.  Let  B  be  an  auxiliary  undirected  graph 
with  the  same  vertices  as  Q ,  but  its  edges  are  based  on  F.  In  other  words,  there  is  an  edge  between 
and  Vj  in  the  graph  B  if  and  only  if  (i,j)  is  in  F.  Because  F  contains  all  the  vertex  pairs  whose 
geodesic  distances  need  to  be  updated,  an  edge  in  B  corresponds  to  a  geodesic  distance  value  that 
needs  to  be  revised. 

To  update  the  geodesic  distances,  we  first  pick  a  in  B  with  at  least  one  edge  incident  on  it. 
Define  C(u)  =  {i  :  e(u,i )  is  an  edge  of  £>}.  So,  the  geodesic  distance  gu  %  needs  to  be  updated  if 
and  only  if  i  G  C(u).  These  geodesic  distances  are  updated  by  the  modified  Dijkstra’s  algorithm 
(Algorithm  3.2),  with  vu  as  the  source  vertex  and  C(u)  as  the  set  of  “unprocess  vertices”,  i.e. ,  the 
set  of  vertices  such  that  their  shortest  paths  from  vu  are  invalid.  Recall  the  basic  idea  of  Dijkstra’s 
algorithm  is  that,  starting  with  an  empty  set  of  “processed  vertices”  (vertices  whose  shortest  paths 
have  been  found),  different  vertices  are  added  one  by  one  to  this  set  in  an  ascending  order  of 
estimated  shortest  path  distances.  The  ascending  order  guarantees  the  optimality  of  the  shortest 
paths.  Algorithm  3.2  does  something  similar,  except  that  the  set  of  “processed  vertices”  begins 
with  V/C(u)  instead  of  an  empty  set.  The  first  for-loop  estimates  the  shortest  path  distances  for 
j  €  C(u)  if  sp(u,j)  is  “one  edge  away”  from  the  processed  vertices,  i.e.,  sp(u,j)  can  be  written  as 
vu  va  Vj  with  a  €  V/C(u).  In  the  while  loop,  the  vertex  ( k  €  C(u ))  with  the  smallest 
estimated  shortest  path  distance  is  examined  and  transferred  into  the  set  of  processed  vertices. 
The  estimates  of  the  shortest  path  distances  between  Vu  and  the  adjacent  vertices  of  v ^  are  relaxed 
(updated)  accordingly.  This  repeats  until  C(u)  becomes  empty,  i.e.,  all  vertices  have  been  processed. 

When  the  modified  Dijkstra’s  algorithm  with  vu  as  the  source  vertex  finishes,  all  geodesic  dis¬ 
tances  involving  Vu  have  been  updated.  Since  an  edge  in  B  corresponds  to  a  geodesic  distance 
estimate  requiring  update,  we  should  remove  all  edges  incident  on  vu  in  B.  We  then  select  another 
vertex  v  /  with  at  least  one  edge  incident  on  it  in  B ,  and  call  the  modified  Dijkstra’s  algorithm  again 
but  with  v  /  as  the  source  vertex.  This  repeats  until  B  becomes  an  empty  graph. 
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Time  Complexity  The  for-loop  in  Algorithm  3.2  takes  at  most  0(q\C(u)\)  time.  In  the  while- 
loop,  there  are  |C(u)|  ExtractMin  operations,  and  the  number  of  DecreaseKey  operations  depends 
on  how  many  edges  are  there  within  the  vertices  in  C(u).  A  upper  bound  for  this  is  q\C(u)\.  By  using 
Fibonacci’s  heap,  ExtractMin  can  be  done  in  0(log  IC'(rt)l)  time  while  DecreaseKey  can  be  done  in 
0(1)  time,  on  average.  Thus  the  complexity  of  algorithm  3.2  is  0(|0(w)|  log  |C(m)|  +  q\C(u)\).  If 
binary  heap  is  used  instead,  the  complexity  is  0(q\C(u)\  log  |C(it)|). 

A. 2.4  Order  for  Performing  Update 

How  do  we  select  vu  in  B  to  be  eliminated  and  to  act  as  the  source  vertex  for  the  modified  Dijkstra’s 
Algorithm  (Algorithm  3.2)?  We  seek  an  elimination  order  that  minimizes  the  time  complexity  of  all 
the  updates.  Let  ft  be  the  degree  of  vK ■  ,  the  i-th  vertex  removed  from  B.  So,  f%  =  \C{k^)\.  The 
overall  time  complexity  T  for  running  the  modified  Dijkstra’s  algorithm  (with  Fibonacci’s  heap)  for 
all  the  vertices  in  B  with  at  least  an  incident  edge  is  0(T ),  with 


T  =  loS  k  +  ?/*)•  (A-!) 

i 

Because  Ylf—  1  fi  a  constant  (twice  the  number  of  edges  in  B)  with  respect  to  different  elimination 
order,  the  vertices  should  be  eliminated  in  an  order  that  minimizes  f%  log  f,j.  If  binary  heap  is 
used,  the  time  complexity  is  0(T*),  with 

T*  =  ?E/ilog/*-  (A-2) 

i 

In  both  cases,  we  should  minimize  Yli  ft  log  f,j.  Finding  an  order  that  minimizes  this  is  difficult, 
unfortunately.  Since  this  sum  is  dominated  by  the  largest  /j,  we  instead  minimize  maxj/);.  This 
minimization  is  achieved  by  a  greedy  algorithm  that  removes  the  vertex  in  B  with  the  smallest 
degree.  The  correctness  of  this  greedy  approach  can  be  seen  from  the  following  argument.  Suppose 
the  greedy  algorithm  is  wrong.  So,  at  some  point  the  algorithm  makes  a  mistake,  i.e. ,  the  removal 
of  V£  instead  of  Vu  leads  to  an  increase  of  max,;  fp  This  can  only  happen  when  deg{vf)  >  deg(yu)- 
We  get  a  contradiction,  since  the  algorithm  always  removes  the  vertex  with  the  smallest  degree. 

Because  the  degree  of  each  vertex  is  an  integer,  an  array  of  linked  lists  can  be  used  to  implement 
the  greedy  search  (Algorithm  3.3)  efficiently  without  an  explicit  search.  At  any  time  of  the  instance, 
the  linked  list  l[i\  is  empty  for  i  <  pos.  So,  the  vertex  in  l[i ]  has  the  smallest  degree  in  B.  The 
for-loop  in  lines  10  to  18  removes  all  the  edges  incident  on  Vj  in  B  by  reducing  the  degree  of  all 
vertices  adjacent  to  Vj  by  one,  and  moving  pos  back  by  one  if  necessary. 

Time  Complexity  The  first  for-loop  in  Algorithm  3.3  takes  0(|E|)  time,  because  |F|  is  the 
number  of  edges  in  B.  In  the  second  for-loop,  pos  is  incremented  at  most  2 n  times,  because  it  can 
move  backwards  at  most  n  steps.  The  inner  for-loop  is  executed  altogether  0(\F\)  time.  Therefore, 
the  overall  time  complexity  for  algorithm  3.3  (excluding  the  time  for  executing  the  modified  Dijkstra’s 
algorithm)  is  0(|F|). 


A. 3  Update  of  Geodesic  Distances:  Edge  Insertion 

In  Equation  (3.3),  we  describe  how  the  geodesic  distance  between  the  new  vertex  vn+i  and  v ^  is 
computed,  after  updating  the  geodesic  distance  in  view  of  the  edge  deletion.  Since  all  the  edges  in 
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A,  the  set  of  edges  inserted  into  the  neighborhood  graph,  are  incident  on  any  improvement  in 

an  existing  shortest  path  must  involve  Let  L  =  {( i,j )  :  w ,j  njr\  +run_)_2.  j  <  9ij}-  Intuitively, 

L  is  the  set  of  unordered  pairs  adjacent  to  vn^_i  with  improved  shortest  paths  due  to  the  insertion 
of  vn+l- 

For  different  (a,  b )  €  L,  Algorithm  3.4  is  used  to  propagate  the  effect  of  the  improvement  in 
sp(a ,  b)  to  the  vertices  near  va  and  v f).  First,  lines  1  to  9  construct  a  set  Sa^  that  is  similar  to  Ra ^ 
in  Algorithm  3.1,  and  it  consists  of  vertices  whose  shortest  paths  to  in  have  been  improved.  For 
each  vertex  in  S' a^,  lines  11  to  22  search  for  other  shortest  paths  starting  from  v ^  that  can  be 
improved,  and  update  the  geodesic  distance  according  to  the  improved  shortest  path  just  discovered. 
Its  idea  is  analogous  to  the  construction  of  F^a  ^  in  Algorithm  3.1,  but  now  sp(a,b )  is  improved 
instead  of  destroyed  as  in  the  case  of  F^a  ^ . 

The  correctness  of  Algorithm  3.1  can  be  seen  by  the  following  argument.  Without  loss  of  gen¬ 
erality,  the  improved  shortest  path  between  v ^  and  Vj  can  be  written  as  >un_ |_i— »  v^-^Vj. 

So,  is  a  vertex  in  T(n  +  1;  a),  and  Vj  must  be  in  both  T(i;  b)  and  T(n  +  1;  b).  If  ij  is  a  child  of 
Vj  in  T(i\  b),  v i  is  also  a  child  of  vj  in  T(n  +  1;  b),  and  (g^  +  9n+\  i)  <  9n  should  be  satished. 

In  other  words,  the  relationship  between  F(i\  b)  and  T(n  +  1;  b)  here  is  similar  to  the  relationship 
between  T(it;  b)  and  T(a;  b)  depicted  in  Figure  A.l.  The  proof  of  these  properties  is  similar  to  the 
proof  given  for  the  relationship  between  F^a  ^  and  i?a^,  and  hence  is  not  repeated. 

r\ 

Time  Complexity  The  set  L  can  be  constructed  in  0(\A\^)  time.  Let  F[  =  {( i,j )  :  A  better 
shortest  path  appears  between  v ^  and  Vj  because  of  vn_^_\  }.  By  an  argument  similar  to  the  complex¬ 
ity  of  constructing  F,  the  complexity  of  finding  FI  and  revising  the  corresponding  geodesic  distances 
in  Algorithm  3.4  is  0(q\H\  +  \A\^). 

A. 4  Geodesic  Distance  Update:  Overall  Time  Complexity 

Updating  the  neighborhood  graph  takes  0{nq)  time.  The  construction  of  Ra ^  and  FQ ^  (Algo¬ 
rithm  3.1)  takes  and  0{q\Faf)\)  time,  respectively.  Since  \Fa^\  >  |i?a^|,  these  steps  take 

0(g|Fa^|)  time  together.  As  a  result,  F  can  be  constructed  in  0(q\F\)  time.  The  time  to  run 
the  modified  Dijkstra’s  algorithm  (Algorithm  3.2)  is  difficult  to  estimate.  Let  p  be  the  number  of 
vertices  in  B  with  at  least  one  edge  incident  on  it,  and  let  v  =  max,;  ft  with  /(■  defined  in  Appendix 
A. 2. 4.  In  the  highly  unlikely  worst  case,  v  can  be  as  large  as  p.  The  time  of  running  Algorithm  3.2 
can  be  rewritten  as  0(pv  log  v  +  q\F\).  The  typical  value  of  v  can  be  estimated  using  concepts  from 
random  graph  theory.  It  is  easy  to  see  that 

v  =  max{£>  has  a  /-regular  sub-graph},  (A. 3) 

where  a  /-regular  sub-graph  is  defined  as  a  subgraph  with  the  degree  of  all  vertices  as  /.  Unfortunately, 
we  fail  to  locate  the  exact  result  on  the  behavior  of  the  largest  /-regular  sub-graph  in  random  graph 
theory.  On  the  other  hand,  the  largest  /-complete  sub-graph,  i.e.,  a  clique  of  size  /,  of  a  random 
graph  has  been  well  studied.  The  clique  number  (the  size  of  the  largest  clique  in  a  graph)  of  almost 
every  graph  is  “close”  to  0(log/r)  [200],  assuming  the  average  degree  of  vertices  is  a  constant  and 
p  is  the  number  of  vertices  in  the  graph.  Based  on  our  empirical  observations  in  the  experiments, 
we  conjecture  that,  on  average,  v  is  also  of  the  order  0(log/z).  With  this  conjecture,  the  total 
time  to  run  the  Dijkstra’s  algorithm  can  be  bounded  by  0(p\og  /iloglog  p  +  q|Fj).  Finally,  the 
time  complexity  of  algorithm  3.4  is  0{q\H\  +  |-4|z).  So,  the  overall  time  complexity  can  be  written 
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o 

as  0(q\F\  +  q\H\  +  /i  log  /i  log  log  /x  +  |.4|").  Note  that  n  <  2\F\.  In  practice,  the  first  two  terms 
dominate,  and  the  complexity  can  be  written  as  0(q(\F\  +  \H\)). 
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Appendix  B 


Calculations  for  Clustering  with 
Constraints 


The  purpose  of  this  appendix  is  to  derive  the  results  in  Chapter  5,  some  of  which  are  relatively 
involved. 


B.l  First  Order  Information 

In  this  appendix,  we  shall  derive  the  gradient  of  the  objective  function  J .  The  differential  of 
a  variable  or  a  function  x  will  be  denoted  by  “d  x" .  We  shall  first  compute  the  differential  of 
J ,  followed  by  the  conversion  of  the  differentials  into  the  derivatives  with  respect  to  the  cluster 
parameters. 

B.1.1  Computing  the  Differential 

The  differential  of  the  log-likelihood  can  be  derived  as  follows: 
n  ,  k 

d  C{9-,y)  =  ^  d  Hog  E  exp  (log  q^j ) 
i=  1  V  j= 1 

n  k 

=  E  E  rij  ( d  lo§  Qij)  ■ 

i=lj=l 

Here,  r^j  =  exp(log exp(log g— *)  =  q{j/YLj*  Qij*  is  the  usual  posterior  probability  for 
the  j’-th  cluster  given  the  point  y j  .  The  annealing  version  of  the  log-likelihood,  which  is  needed  if  we 
want  to  apply  a  deterministic-annealing  type  of  procedure  to  optimize  the  log-likelihood,  is  defined 

by 

n  k  1  n  k 

^annealed  {9.yn)  £  ^  logq..  g  £  rjjlogfy,  (B.2) 

i=lj=l  1  i=ij=i 

where  7  is  the  inverse  temperature  parameter.  Note  that  £annea^e<^(d;  J7, 7)  becomes  the  log- 
likelihood  C(9;y)  when  7  is  one.  The  temperature  invtemp  is  different  from  the  smoothness  pa¬ 
rameter  r:  7  is  related  to  all  the  data  points,  whereas  r  is  only  concerned  with  objects  involved  in 


^  exp(log^)  (d  log  q, 

E—/  /  -J 

i= 1  j=l 


U 


E,/  exp(log 0. .,) 

J 


(B.l) 


157 


the  constraints.  The  “fuzzy”  cluster  assignment  r^j  is  defined  as 


q 


rij 


(B.3) 


A  small  value  of  7  corresponds  to  a  state  of  high  temperature,  in  which  the  cluster  assignments  are 
highly  uncertain.  The  first  term  in  Equation  (B.2)  can  also  be  understood  as  a  weighted  sum  of 
distortion  in  coding  theory,  with  fjj  as  the  weights  and  log  q^j  as  the  distortion.  The  second  term 
in  Equation  (B.2)  is  proportional  to  the  sum  of  the  entropy  of  77 .  Because 


/.annealed(0.j;>7)  =  ^r-..log9.. 

ij 


i^r^ioggT. 

ij 


1 

7 


H  ^  H  exp  (7  log  qu) 

i  l 


1 


the  differential  of  the  annealed  log-likelihood  is  similar  to  that  of  the  log-likelihood,  which  is 

d  £annealed(0;  y,  7)  =  fij  (d  log  qvj)  (B.4) 

ij 


Our  next  step  is  to  derive  the  differential  for  the  constraint  satisfaction  function  !F(9\C).  Based  on 
the  definition  of  s^j  in  Equation  (5.12),  we  can  obtain  its  differential  as 


d  log Sjj  =d  (rlog -  d  log  ^  exp  (r  log  <7^) 

l 

=  T(d  log  Qij)  ( d  log  qit) 

l 

d  sij  =  Ts^  (d  log  q{j  -  ^  Hj  ( d 

j 

dthj  =Hahi  (d  sij ) 
i 

dthj  =  Y,bhi(d  Hj) 
i 

Note  that  EjLi  d  s ^  =  E j=\  d  =  £)=1  d  lh j  =  0  because  Ej  sij  =  Ej=i  t^j  = 

Ej— 1  tfrj  =  1-  The  differential  for  the  negative  entropy  of  s^j,  tj^  -  and  t^-  can  be  derived  by 
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considering 


d  'Eaijl0&aij  =  E(logs'jj  +  1Kd  sij )  =J2lossij(d  sij^ 


=  T  E  SU  log  s*i  d  log  qij  ~  E  siZ  (d  log  qu) 


=  rE  I  Sy  iog S{j  -  Sij  E  SU  log 1  (d  log qij) 


d  E  =  E log ltj ( d  #j) 

i  j 


=  TElogtd,  Hahisij  d  log  qij  -  E  sil  (d  log  ^/) 
j  i  V  l 


=  T  E  ahisij  log  E  -  E  siZ log  E  (d  log 


d  E  hi l0g  ^ hi  z  T  E  log  %  -  E  s*z log  hi  (d  log  9 


/y  6  /ij  “  ^ 
J  ij 


The  differential  for  the  Jensen-Shannon  divergence  term  is  then  given  by 


d  £>+5(/i)  =  d  |  E  ahi  E  sij  log  sij  -  E  hj log  hj 


*  J 


=  T  E  ahisij  I  logSy  -Esd  log-S'd)  ( d  logd7;j 


-  r  E  ahisij  ^log  t'hj  ~  E  sil log  E  j  (d  log  qij) 


=  T  E  a/usij  log  7T  -  E  sil log  7T  (d  log  9 


z — ,  -  .-r 

lhj  l  lhl 


d  D  JS{h)  =tJ2  bhisij  I  log  j1-  -  E  sil  log  ^  I  (d  log  qtj 
ij  \  %  l  hi) 
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The  differential  of  the  loss  functions  of  constraint  violation  can  thus  be  written  as 


✓  m  1  m  x 

dmc)=d  -  y.  xtDjs^+T.  xhDJsw) 

v  h= 1  h= 1  7 


/i=l 
dm 


=  -T  E  (  E  E  ahisij  ( loS  “F  -  E  si£  lo§ 


ij  Vi=] 
m~ 


h=  1 


%  l 


-  E  Xh  bhisij  (l°Z-=--Y  Hi  lo§  7T- )  )  (d  log9; 


hi 

sil 


hj  l 


hi 


sij 


=  -T  E  E  E  ahisij  loS  “F  -  E  Xh  bhisij  loS 

'h=l  hj  h=  1  hj 


m 


-«EE  loS 


l  h=l 
to- 


Hi 


syEE  \  lo§  FT  )  (d  loS%' 

l  h=  1  thl  / 


=  -rE  “sy  E^i/j  (d  log%) 


where  we  define 


/  to”*~ 


wij  ~ 


TO 


E  Afia/u  sijloSsij  -sij  E  X~h,ahi  log  ^~hj 

\h=  1  /  /i=l 

^TO  \  TO_ 

-  E  E  Hi  Hj  1o§  Hj  +HjYXh  bhi  loS  % 

\h=  1 


d=l 


/  m.^” 


m 


E  Eaw  -  E  EE  |  sy  logE 

\/i=i  /i=i 

/  TO+ 

-e  |  E  Ea/ulog^  -  E  EE^E? 


y/i=i 


/i=i 


u 


(B.5) 


(B.6) 
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It  is  interesting  to  note  that 


n  k 


E  E 


i=lj=l 

n  k  /to”*"  to- 

=  EE  E xhaMsij lo&sij -  EE  AE6ws'y logsu 
i=l  j=lA/i,=l  ft,=l 

to”*"  ?n—  \ 

-  E  XhahiHj]o&thj  +  E  Xhbhisijl° 

h= 1  /i=l  / 

?n”*”  n  k  „ . .  to  n  k  „ . . 

=  E  E  E  E  wy log  -r  -  E  E  E  E  bmsij log  -11 

/i=l  i=lj=l  r/ij  /i=l  «=lj=l  lhj 

to”*”  to- 

=  E  xtDjSw  -  E  OTsW = ^c> 

/i=l  ft,=l 


(B.7) 


Therefore,  summing  all  w^j  provides  a  way  to  compute  the  loss  function  for  constraint  violation. 


We  are  now  ready  to  write  down  the  differential  of  J : 

n  k  /  ,  k  \\ 

rfJ=EE(  fij  ~  T  (  wij  ~  sij  E  wil  )  ( d  log 


*=1.7=1 


1=1 


(B.8) 


B.1.2  Gradient  Computation 


Since  the  only  differentials  in  Equation  (B.8)  are  (d  log  q,jj ) ,  the  gradient  of  J  can  be  obtained  by 
converting  these  differentials  into  derivatives.  Recall  that  q^j  =  ajp( y^|0).  So, 

5jE-log(/,7  =  7(i=i), 


where  /(.)  is  the  indicator  function,  and  is  one  if  the  argument  is  true  and  zero  otherwise.  To  enforce 
the  restriction  that  aj  >  0  and  j  aj  =  E  we  introduce  new  variables  f3j  and  express  aj  in  terms 
of  {Pj}: 


exp  (fij) 

'j  ~  T,k-/=1eMPj/) 


(B.9) 
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We  then  have 


W  logaj  =  7W  Pj-  log E  expE7)  =  !{j  = l)  -  vCXp(;jfo  ' 

<9/3;  ■>  5/3;  1  J  ^  3  E;/ex  p(/3  -,) 

V  /  /  JJ 

=  I(j  =  l)-ai 
k 

5  .  v-  51°g<?//  51ogaTO  7-/7  7  /r/  ./  \ 

m ,ogm  =  S  aioga„,  ai,  =  |>  =  m)  d(m  = 5)  -  “5 

J  m=  1  ^  fn 

=  I(j  =  l)  -  Otj 

If  p(yj\Oj)  falls  into  the  exponential  family  (Section  5.1.1),  and  8 j  is  the  natural  parameter,  the 
derivative  of  log  q,L,j  with  respect  to  d;  can  be  written  as 


J-log^  =/(;  =  !)  (*(*,- ^1(9, )). 


(BIO) 


Note  that  </>(y^)  —  -tjq- A(0i )  is  zero  when  the  sufficient  statistics  of  the  observed  data  (represented 
^  f) 

by  </>(y;))  equal  to  its  expected  value  (represented  by  ^-t4(0;)).  In  this  case,  the  convexity  of  A{6^) 
guarantees  that  the  log-likelihood  is  maximized. 

Before  going  into  the  special  case  of  the  Gaussian  distribution,  we  want  to  note  that  for  any 
number  c^j ,  we  have 

E  cij -Jr  '°g  Qij  =  E  cij  (I(l  =  J')  “  “;)  =  E  cil  ~  E  cij 


E  ci3  w, log  =  E  cu  df.  lo§  qu  =  E  cu  ( w  -  ^A{9l) 

ij  L  i  L  i  1 

=  E  cu^yi)  -  E  cil 


The  gradient  of  J  can  be  computed  by  substituting  c^j  =  r^j  —  —  s^j  S/— ^  wil)- 


B.1.3  Derivative  for  Gaussian  distribution 

Consider  the  special  case  that  p{y^\8i)  is  a  Gaussian  distribution.  Based  on  Equation  (5.6),  we  can 

IT1 

see  that  the  natural  parameters  are  Y;  and  zz;,  the  sufficient  statistics  consist  of  y,;  and  —  ^y;yf  , 
and  the  log-cumulant  function  A(6A  is  given  by  Equation  (5.7).  In  this  case,  we  have 

5  v  -  v  - 

=  E  w  -  E  cil 

^  i  i 

mJ  =  ~2^ci‘y‘yi  +(5',i"i  +5s()Y=ii 

1  i  i 

Note  that  the  above  computation  implicitly  assumes  that  Y;  is  symmetric.  To  explicitly  enforce 
the  constraints  that  Y;  is  symmetric  and  positive  definite,  we  can  re-parameterize  by  its  Cholesky 


(B.ll) 

(B.12) 
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decomposition: 


Ti=FzFf,  (B.13) 

Note,  however,  with  this  set  of  parameters,  the  density  is  no  longer  in  its  natural  form.  The  gradient 
with  respect  to  remains  unchanged,  and  it  is  not  hard  to  show  that 

afr  los  <Hj  =  =  0  (-y*yf  +  +  ez)  fz.  (b.m) 

=  -  Z  s/y*yf  Fz  +  +  sz) F/  Z  cn  (B-15) 

i  i 

Alternatively,  the  Gaussian  distribution  can  be  parameterized  by  the  mean  nj  and  the  precision 
matrix  Tj  as  in  Equation  (5.5).  Because 

■Jj-  log  qij  =  I(j  =  -  m)  (B.16) 

logqij  =  i{j  =  i)  Qsz  -  ^(y*  -  ^)(y*  -  (B-17) 

loggjj  =  I(j  =  l)  (sz  -  (y?;  -  /xz)(yi  -  M/)T)  fj,  (B.18) 

the  corresponding  gradient  of  J  is 

^  i 

d  \  1 

W]J  =  2^l^Cil  ~  2^Cil(-yi  ~  ^Vi  ~  (R2°) 

^  i  i 


B.2  Second  Order  Information 

The  second-order  information  (Hessian)  of  the  proposed  objective  function  J  can  be  derived  in  a 
manner  similar  to  the  first  order  information.  We  shall  first  compute  the  second-order  differentials 

9 

and  then  convert  them  to  the  Hessian  matrix.  Let  dL  x  denote  the  second-order  differential  of  the 
variable  x. 


B.2.1  Second-order  Differential 

By  taking  the  differential  on  both  sides  of  Equation  (B.4),  we  have 

d2  ^annealed (0;y>7)  =  J^(d  r^)  (d  log +  Zfij  lo^ij)  •  (B-22) 

ij  ij 
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To  compute  d  r^j,  we  take  the  differentials  of  the  logarithm  of  both  sides  of  Equation  (B.3): 


d  log  f,tj  =  d  log  ql-  -  log  E  <l} 


Hi 


1=1 


(B.23) 


(B.24) 


=  7  d  log  q.j  -  - -  E  qll  d  log  Qil 

^l'=  1  V  1=1 

=  7  log qij ~Ynid  lo§ ■ 

Because  of  the  identity  that  d  x  =  x  d  log  a’  for  x  >  0,  we  have 

d  r.jj  =  fij  d  log  fij 
Substituting  Equation  (B.24)  into  Equation  (B.22),  we  have 
d2  cannealed(6-,y,  7) 

=  Ef'jj  (f/2  loS%)  +7 Yfij(d  lo§dy )(d  log^-) 

U  ij 

-7EE^  d  loSdi/E%'d  log9ii 
i  l  j 

=  E%'  (rf2  loSdy)  +  log«y)(d  logg^). 

ij  ijl 

Here,  d  -7  is  the  delta  function,  and  it  is  one  if  j  =  l  and  zero  otherwise.  The  definition  of  s,;j  in 


(B.25) 


°jl 

Equation  (5.12)  implies  the  following: 

d  Sjj  =  s.}j  d  log  Sjj 


d  log  =t  d  log  qij  -  E  SU  d  log  qn 


ij 

(B.26) 

(B.27) 


1=1 


Note  the  similarity  between  the  definitions  of  s^j  and  f^j.  Because  for  any  i,  —  Y^l  w^)  = 

0  and  d  Sjj  =  d  Sjj  =  0,  Equation  (B.5)  can  be  rewritten  as 

d  F{9,C)  =  -^EE(wij  _  sij^2wil)(d  log Qij) 


"ij  "ij  . 
i  j  l 


=  -tE  EK'-^E  wil)(d  lo§  9ij~Yail  d  logqU ) 

i  V  j  l  l 


'  sij 


=  -E  EKj-syE  wil)d  lo§ 

i  \  j  l 

=  -  E E wij  d  loS sij  +  E E d  sij  E wil  =  EE wij  d  l0§ sij 


l  J 


l  J 


l  J 
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9  9 

The  differential  of  this  expression  yields  d  T{6\C).  So,  we  need  to  find  d  w^j  and  d  log Sjj.  The 
definition  of  w^a  in  Equation  (B.6)  means  that  its  differential  is 


(  TO+ 


d  waa  = 
LJ 


in 


E  Xhahi  ~J2Xh  bhi  |  (l°Ssij  +  Vsij  d  l°Ssij 

\h=l  h=  1 


sij 


A +ahj  ,  ™  \7bhj 

E  JnhLd  %  -  E  % 

\h=  1  hj  h=  1  hj 


E  xh  ahi  log  lhj  ~  E  \  bhi  l0§  %  d  log  sij 


\h=  1 


d=l 


m”1"  m  y— 

=  wijd  1  °Ssij  ~  ~^~ahisij(d  t hj )  T  7 ~dhisij(d  ^ hi )’ 


d=l  bhj 


hj>  —  t- 
h=  1  Tij 


where  we  define  u>E  =  w^j  +  ^XJEl  A^ a ^  —  X™_1  'V  bhi  )  sij  •  Taking  the  differentials  of  both 
sides  of  Equation  (B.27),  we  have 


(A/  A/  \ 

d2  log  qjj  -  E  sz/  ^  log(?d  -  E  sEd  loSszZ)(d  log^)) 
1=1  1=1  7 

=  r(d2  log Qij  -  E  sil  d' 2 


-  E  sil(d  loS sil)(d  loS Sil+T  E  V  d  loS«,(') 

/=!  Z'=l 

/  fc  v  k 

=  r(d2  \ogqij  -  E  d 2  loS dil  )  -  E  sEd  logszEd  loSsd) 

V  Z=1  2  /=1 

Note  that  we  have  used  the  fact  Xf— i  s?7(^  logs^)  =  0-  If  we  define  w^j  =  w^j  —  s^j  X/1— ^  wil’ 
we  can  write 


E  wij  d 2  log  sij 


ij 


=  r  E  log  qij  EE  wij  E  s?7(d  logsiZ)(d  logsiZ), 

U  z  j=l  Z=1 


(B.28) 
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Putting  them  together,  we  have 


eft  T(Q\C)  —  Wj j  efi  log s^j  +  d  w^j  d  logSjj 


*7 


=  Y,wij  d2  lo&sij  +^2wij(d  loS sij)(d  log sij) 


*7  ij 

m'  x + 

~  E  E  E  J-(d  ^)(°wsij d  lossij) 

j  /i=l  i  hj 


m 


a; 


EEE^  hjWhiHj  d  l°Zsij ) 

j  h= 1  i  thj 


(B.29) 


=  tE^J  loS9ij  +EKj  _sij  12wil)(d  l°SHj)(d  loS sij) 


*7  ij 

m+  x  1 


7=1 


m  y- 

-EE-ff*1  ^-)  +  E  E  pNd  %^d  % ) 

ft=l  j  thj  h= 1  j  thj 


Note  that  w{j  -  sy  Ef=1  +  (  E™=1  A/Ta/u  “  £/[=i  \  bhi  j  s 


Nj- 


B.2.2  Obtaining  the  Hessian  matrix 


Our  goal  in  this  section  is  to  obtain  the  Hessian  matrix  of  with  respect  to  the  parameters  ordered 

<9  log  ^  7 

by  Let  denote  the  column  vector  — ^ ,  and  define  \17;i  to  be  the 

|0u|  by  n  matrix  . . . ,  ipnu],  where  \9y\  is  the  number  of  parameters  in  9U ■  Let  T)uv  be  a  n  by 
n  diagonal  matrix  such  that  its  (i,i)-th  entry  is  7 (duv  ~  ^iv)^iu'  -*-1  n  denote  a  1  by  n  matrix 
with  all  its  entries  equal  to  one.  Let  be  the  Hessian  matrix  of  (log  q^j)  with  respect  to  the  /3ys, 
i.e. ,  the  (u,  v)-th  entry  of  Up  is  given  by 


d ^  d ^ 

d  fad  fa  gqij  =  dPudfo 


logttj  —  -gjj-  {^ju  ~  aw)  —  ~a'U  {fiuv  —  &v)  ■ 


Here,  6Uv  is  the  Kronecker  delta,  which  is  1  if  u  =  v  and  0  otherwise.  Note  that  does  not  depend 
on  the  value  of  i  and  j  in  logg^.  Let  H ^  denote  the  Hessian  of  logp(y?-|p)  with  respect  to  9j.  Its 
exact  form  for  the  case  of  Gaussian  distributions  will  be  derived  in  the  next  section. 
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B.2.2.1  Hessian  of  the  Log-likelihood 


Based  on  Equation  (B.25),  we  have 


9  £annealed(0;3;i7) 


ddudOv 


=  £ 


o 

<9  log  q, 


*3 


Ijl 


(d^gqlJ\(d\ogqll\T 
iV'H  \  fU)..  \  nn..  i 


"'wvy  dou  J  V  ; 


o 

r  ~  ^  log  Qjqi  Y^/c  ~  ,  , 

=  ^  Z.  riu~^2 —  +  ^  riv)riu^iu^ 


T 

iv 


-  Suv  ^2  rmHra  + 


j2 

dpudev 


—cannealed(e;y,7) 


=  7  £%  ^  fU^ij  (Suj  ~  a«)  ^  (^f^)  =  7  E^' 

ijl  i 


uv  ^iv^iu^iv 


—  *-1  .n^uv'&'v 


dl 


.^annealed 


d(3udl3v 

=  ^ 2rijau(av  ~  Suv)  +  7 £(^j7  —  ?il)?ij(duj  ~  au){\,i  ~  oiy) 
ij  ijl 

Er  j  r 

(Zwu  —  ^iv)^iu  =  nau{av  ~  3'uv)  +  r 


n 

Define  H  ,  the  “expected  Hessian”  of  the  annealed  log- likelihood,  by 
H£  =  blk-diag(nHg,  J2filUib  ■  ■  • » 


(B.30) 


It  can  be  viewed  as  the  expected  value  of  the  Hessian  matrix  of  the  complete-data  log-likelihood. 
Define  a  k(  1  +  |0]J)  by  matrix  A  and  partition  it  into  2k  by  k  blocks,  so  that  the  (i,j)-th  block 
is  lq  n,  and  the  ( j  +  k,j)- th  block  is  \l f  j,  where  1  <  i  <  k.  All  other  entries  in  A  are  zero.  In  other 
words, 


■*■1  ,n 

■*■1,71 

1 1  ,n 

*■1,71 

*1 

[-0H... -0nl] 

1 - 

■e 

i _ 

Wlk  ■  •  ■  ^nkl 

(B.31) 


Let  D  be  a  nk  by  nk  matrix  and  we  partition  it  into  k  by  k  blocks,  so  that  the  (u,  u)-th  block  is 
.  With  these  notation,  the  Hessian  of  the  annealed  log- likelihood  is 


Zi£annealed(0;>,i7)  =h£ 

dOz 


+  adat 


(B.32) 
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D  is  symmetric  because  the  (i,i)-th  element  of  both  T)uv  and  D Vu  are  7 (Suv  —  f'iu^iv  Also,  the 
sum  of  each  of  the  column  of  D  is  0  because  7 (Suv  ~  ^iu^iv  =  0- 


B.2.2.2  Hessian  of  the  Constraint  Violation  Term 


By  converting  the  differentials  in  Equation  (B.27)  into  derivatives,  we  have 


<9  log  Sij 

ddu 


—  r  ^ju^iu  Sil^lu^iu )  T^ju 


1=1 


d  log  s  •  •  /  ^ 

=  T  y*ju  ~  au  ~  sil^lu  —  au )  )  =  T(Sju  —  siu') 


This  implies 


V-/  *  V-  Jdlo&sij\fdlogslJ\T 

Eby-  -  H,  E  ”>«)  (-gfrT)  {Sfa 

IJ  1=1 

l  k 

=  T  I  ^2(wij  —  sij  wil)^ju  ~  siv)^jv  ~  siv )  )  =  ^-l,n^‘uv^-\  ,f 

i  V  j  1=1 


Similarly,  we  have 


(  d  log  s  j  j  ^  {  d  log  srij 


ij  1=1 

k 


d0} 


EH,  -  ‘a  E  »«)  (PP  (PPf  = 


zj  "y 

ij  1=1 


deu  J  V 


Here,  (^2  j(w*j  —  Sjj  "Yjff—x  wil)(dju  ~  siv)(^jv  ~  szu))  the  (*>*)_th  element  of  the  n  by  n 
diagonal  matrix  E uv-  Let  SLj2ju  denote  a  vector  of  length  n  such  that  its  z-th  entry  is  given  by 

Tahisij(Sju  ~  siu )■  Because  d  t+  =  £*  ahi  d  sy  =  J2i  °-hisij  d  log szj>  we  have 


dthj  _  51o§szj 

(90,  . 


5 ~lahisij  00  T'^/ahisij(dju  si  uWiu  ^ u&hji 


<9  tT. 
hj 

dpu 


T'^/ahisij(dju  siu )  ^l,nahji 


This  means  that 


m+  k  A+  /<9t+\  (dt+. 

ST  ST  h  I  hj  hj 

l  d°u  )  \  dOv 

h=lj=l  hj  \  /  \ 


T 


= 


( m”*"  k  \+ 

EEK4 

\h=lj=l  Thj 


=  '&UAUL+Av'&v, 
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where  we  concatenate  different  a ^ju  to  form  a  n  by  km~ matrix  Au,  defined  by 

Am  =  [al,l,M>  a2,l,w>  •  ■  •  >  aTO+,l)U’  al,2,w>  ■  •  •  >  aTO+,2,w’ '  "  ’  al ,k,u>  ‘  ’  a  n+,/c,  J' 

Note  that  Au  has  similar  sparsity  pattern  as  the  matrix  {a^j}.  The  diagonal  matrix  L~*”  is  of  size 


A 


+ 


km^~  by  km~ K  Its  diagonal  entries  are  given  by  —t,  and  the  ordering  of  these  diagonal  entries 

E' 

matches  the  ordering  of  a ^ju  in  Au .  By  similar  reasoning,  we  have 


m+  k  Xf  ( dt 

EE 


hj  \  I  dthj 


T 


h=lj=1t+  \  Wu  )  \ 


—  Av  *ffv 


m+  k  x+  /  dt 


dt' 

E  Ef  ^ 

ft=lj=l  hj 


The  case  for  ,  which  corresponds  to  to  must-not-link  constraints,  is  similar.  So,  we  define  b \^ju 
to  consist  of  Tbfl^s^j(Sju  —  Sju)  for  different  i,  and  concatenate  b ^ju  to  form  Bl(.  L~  is  a  diagonal 
AT" 

matrix  with  entries  Substituting  all  these  into  the  result  derived  in  Equation  (B.29),  we  have 

thj 


d  \ '  d 


ij 


i-J 


-E 


dz  d  d 

-  “y  l0g^'  +  EKj  -  Hj  E  wu)  logsy )  logs*j) 

ij  ij  1=1 


T 


u 


m  \+  O  rv  m  \—  r,  rv 

-  E  E  +  E  E  £<£**)<£•*) 

h=l  j  hj  h=  1  j  hj 


—  t^uv E 
i 

=  t^uv  5 


1(>EM  +  *uEm,*£  _ 
dO? 


Ju 


d~  loS  %u 

'  del 


+  —  A^L  '  Av  +  B^L  B 


-tjTA  lT.r 


Similarly,  we  have 

<9  ^  9 


d/3u  E  wij  log  sij  ~  11  ,n  (Em;  -  AUL+A^  +  BUL  B^j  ^ 
“  ij  V 


d 


d 

dfiu^^  dQv^^d 
ij 


T,wijm:losaij  =  rE< 


a2  logg.y 


ij 


lJ  d/3ud/3v 


+ 


L,?r  ( 


Emi>  —  A^L"1"  A^1  +  BUL  By  J  1^  n 


~t>T\  ,T 


169 


n 

Let  H'-'  denote  the  “expected”  hessian  of  the  complete  data  log-likelihood  due  to  the  constraints, 
i.e., 

HC  =  blk-diag(0,r^]'u;7;1Hj1, . . . ,  r  (B.33) 

i  i 

Note  that  there  are  no  Hessian  terms  corresponding  to  the  /3j  because  J2j  ,xij  =  0-  Let  E  be  a  nk 
by  nk  matrix.  We  partition  it  into  k  by  k  blocks,  such  that  the  (tqw)-th  block  is  Eu-y.  Let  A  be  a 
nk  by  km+  matrix  and  B  be  a  nk  by  matrix,  such  that 


Ax' 

Bi' 

A  = 

B  = 

l 

< 

_ i 

B  k. 

We  are  now  ready  to  state  the  Hessian  term  corresponding  to  the  constraints: 


d 2 
~df 


-  £ 

y  h= 1  h= 1 

=  -if  -  AEAT  +  AAL+ATAT 


ABL~BTAT  (B.34) 


Note  that  the  sum  of  each  of  the  columns  of  A  is  0,  because  YliuTahisij^ ju  ~  siu )  = 
fjiTahisij^uifu  ~  siu )  =  0-  Combine  Equation  (B.34)  with  Equation  (B.32),  we  have  the 
Hessian  of  the  objective  function  J  in  matrix  form: 

=  h£  hc  +  adat  -  aeat  +  aal+atat  abl~btat 

89 2  (B.35) 

=  H£C  +  A  (d  -  E  +  AL+AT  -  BL~BT)  AT. 

~  r  r>  ~  r  ~  n 

Here,  =  EP  HL  is  the  combined  expected  Hessian. 


B.2.3  Hessian  of  the  Gaussian  Probability  Density  Function 

Computation  of  requires  H jj,  which  is  the  result  of  differentiating  logp(y,j|0?  )  with  respect  to 
the  parameter  9j  twice.  We  shall  derive  the  explicit  form  of  H jj  when  \ogp(yj\9 j)  is  the  Gaussian 
pdf.  For  simplicity,  we  shall  omit  the  reference  to  the  object  index  i  and  the  cluster  index  j  in  our 
derivation. 

We  shall  need  some  notations  in  matrix  calculus  [179]  in  our  derivation.  Let  vecX  denote  a 
vector  of  length  pq  formed  by  stacking  the  columns  of  a  p  by  q  matrix  X.  Let  Y  be  a  r  by  s  matrix. 
The  Kronecker  product  X  (g>  Y  is  a  pr  by  qs  matrix  defined  by 


X®  Y 


*11 Y  x12Y  *1  qY 

x21Y  x22Y  ■  ■  ■  x2  qY 


xpiY  . . .  XpqY 


(B.36) 


The  precedence  of  the  operator  ®  is  defined  to  be  lower  than  matrix  multiplication,  i.e.,  XY  ®  Z  is 
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the  same  as  (XY)  0  Z.  The  following  identity  is  used  frequently  in  this  section: 


vec(XYZ)  =  (ZT  0  X)  vec  Y.  (B.37) 

9  9 

Let  denote  a  permutation  matrix  of  size  by  d  ,  such  that 

K^vecZ  =  vecZ^,  (B.38) 

rn  _ 1 

where  Z  is  a  d  by  d  matrix.  Note  that  =  K^. 


B.2.3.1  Natural  Parameter 

When  the  density  is  parameterized  by  its  natural  parameter  as  in  Equation  (5.6),  we  have 

logp(y)  =Y-\  ft-1  +  Y~T)  i/ 

**p(y)  =  +  \*~T  +  \?~T^Tr~T 

logp(y)  =  -yyTF  +  F~T  +  r~TiyuTr~TF, 

where  Y~-^  denotes  the  transpose  of  Y Therefore, 

|^iogP(y)  =  -i(r-i+T-r) 

.  iogp(y)  =  t/r-18>r-r^  +  r-1i/»r-r) 
avecT  av  2  V  / 

=  \  (T_1  ®  (id  ®  ®  y 

„  W  logp(y)  =  F^Y-1  0  Y~Tj/  +  F^Y—1iz  0  Y~T 
avecF  ai/ 

The  last  term  in  the  Hessian  matrix  requires  more  work.  We  first  take  the  differential  with  respect 
to  Y: 

d  ^iogp(y)  =  -\r ~T(d  TT)T~T 

-  ^r-TwTr~T(d  tt)y_t  -  7;~f-T(d  rT)r^TuuTr~T. 

By  using  the  identity  in  Equation  (B.37),  the  Hessian  term  can  be  obtained  as 

- — — — :y  logp(y)  =  ~  (y_1  0  Y~TN)  K, 

d(vecY)2  2  V  )  d 

-  i  ^Y—1  0  Y~T vvT Kd  -  i  {r~1uvTt~l  0  Y”'T)  Kd 
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Similarly,  the  Hessian  term  corresponding  to  F  can  be  obtained  if  we  note  that 


d  A  logp(y)  =  -yy T (d,  F)  -  F ~T (d  FT)F~T  -  T-TuuTF~T(d  FT)F~T 

m  rji  rri  rri  rri  rj~\ 

-  Y~J  (F (d  F  1)  +  (d  F)Fi  )Y_i  vu1  F~J 

2 

logp(y)  =  -Id  ®  yyT  -  (f"1  ®  F'T)  Kd  -  (F-1  ®  MMTF)  Kd 

-  {fT  miT  ®  F Kd  -  FT/i/iTF  <g>  E 

In  the  special  case  that  Y  is  always  symmetric,  we  can  have  a  simpler  Hessian  term.  This  amounts 
to  assuming  that  Y^  =  Y  and  (dT)^  =  (d¥).  We  have 

1  1  T1  1  T1 

— - — 7T  lc,gp(y)  =  -  ®  S  -  -E  0  vn1  -  -Hfi1  <g>  E 

a(vecY)^  i  z  i 

=  ((S  +  ®  (S  +  -  (m  ®  h)(mt  ®  mt)) 

B.2.3.2  Moment  Parameter 

When  moment  parameter  is  used  as  in  Equation  (5.6)  for  the  density,  we  have 

■^loSp(y)  =  \(r  +  ?T)(y-v, 

^  logp(y)  =  -I(y  -  f)(y  -  k)t  +  )t-t 
—  logp(y)  =  -(y  -  /i)(y  -  /x)TF  +  F~T 

The  second-order  terms  include 

|i,„gp(y,  =  _i(r  +  rr) 

2 

<9  vecY  logp(y)  =  5  (xd  ®  (y  -  M)  +  (y  -  /*)  ®  xd) 

=  ^(Id2+Krf)  (Irf  ®  (y  - /*)) 

gvefF  0fl  lo§p(y)  =  Fr(y  -  m)  ®  id  +  ft  ®  (y  -  n) 

=  (FT®Id)(Id2+Kd)  (Id®(y-M)) 

As  in  the  case  of  natural  parameter,  we  have 

d  Alogp(y)  =  --T~T(d  Tr)Y-T 

- ylogp(y)  =  ~  (y^1  ®  r~T)  Kd 

d(vecY)2  2  V  /  d 

d  A  logp(y)  =  -(y  -  /x)(y  -  /x)T(d  F)  -  F ~T (d  FT)F-T 
2 

—^—2  logp(y)  =  -Id  ®  (y  -  M)(y  -  m)T  -  (F—1  ®  F~T)Kd 
a(vecF)z 
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If  we  assume  both  T  and  d  T  are  always  symmetric,  we  have 


g2 

<9(vec  T ) 


logp(y) 


S 
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